The file ‘WeatherDownloads_202005_v002.Rmd’ contains code for dowloading and processing historical weather data as contained in METAR archives hosted by Iowa State University.
Data have been dowloaded and processed for several stations (airports) and years, with .rds files saved in “./RInputFiles/ProcessedMETAR”.
This module will perform exploratory data analysis on the processed weather files.
Each processed data file contains one year of hourly weather data for one station. Files are saved as ‘./RInputFiles/ProcessedMETAR/metar_kxxx_yyyy.rds’ where xxx is the three-digit airport code and yyyy is the four-digit year.
Each file contains the following variables:
There are several functions available for analysis:
plotCountsByMetric() - bar plots for counts by variable
plotNumCor() - plot two numeric variables against each other
plotFactorNumeric() - boxplot a numeric variable against a factor variable
corMETAR() - correlations between METAR variables
lmMETAR() - linear regression modeling for METAR variables
basicWindPlots() - plot wind speed and direction
getWindDirGroup() - convert wind direction to a grouping (e.g., N for 320-360-40)
consolidatePlotWind() - show frequency plots of wind direction, city, and month
The tidyverse library is loaded and the 2016 Detroit data is read in to show examples of the functions:
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
kdtw_2016 <- readRDS("./RInputFiles/ProcessedMETAR/metar_kdtw_2016.rds")
A variable mapping is created to allow for better readable variable names:
varMapper <- c(WindDir="Wind Direction (degrees)",
predomDir="General Prevailing Wind Direction",
WindSpeed="Wind Speed (kts)",
WindSpeed5="Wind Speed (kts), rounded to nearest 5 knots",
Visibility="Visibility (SM)",
TempC="Temperature (C)",
DewC="Dew Point (C)",
Altimeter="Altimeter (inches Hg)",
Altimeter10="Altimeter (inches Hg), rounded to nearest 0.1 inHg",
modSLP="Sea-Level Pressure (hPa)",
TempF="Temperature (F)",
DewF="Dew Point (F)",
TempF5="Temperature (F), rounded to nearest 5 degrees",
DewF5="Dew Point (F), rounded to nearest 5 degrees",
cType1="First Cloud Layer Type",
cLevel1="First Cloud Layer Height (ft)",
month="Month",
year="Year",
wType="Greatest Sky Obscuration",
day="Day of Month"
)
The function plotCountsByMetric() produces bar plots for counts by variable:
# Helper function for generating plots by key variables
plotcountsByMetric <- function(df,
mets,
title="",
rotateOn=20,
dropNA=TRUE,
diagnose=FALSE,
mapper=varMapper,
facetOn=NULL,
showCentral=FALSE
) {
# Function arguments
# df: dataframe or tibble containing raw data
# mets: character vector of variables for plotting counts
# title: character vector for plot title
# rotateOn: integer, x-axis labels will be rotated by 90 degrees if # categories >= rotateOn
# dropNA: boolean for whether to drop all NA prior to plotting (recommended for avoiding warnings)
# diagnose: boolean for whether to note in the log the number of NA observations dropped
# mapper: named list containing mapping from variable name to well-formatted name for titles and axes
# facetOn: a facetting variable for the supplied df (NULL for no faceting)
# showCentral: boolean for whether to show the central tendency over-plotted on the main data
# Function usage
# 1. By default, the function plots overall counts by metric for a given input
# 2. If facetOn is passed as a non-NULL, then the data in #1 will be facetted by facetOn
# 3. If showCentral=TRUE, then the overall mean will be plotted as a point on the main plot (only makes sense if facetOn has been selected)
# Plot of counts by key metric
for (x in mets) {
# If a facetting variable is provided, need to include this in the group_by
useVars <- x
if (!is.null(facetOn)) { useVars <- c(facetOn, useVars) }
dat <- df %>%
group_by_at(vars(all_of(useVars))) %>%
summarize(n=n())
if (dropNA) {
nOrig <- nrow(dat)
sumOrig <- sum(dat$n)
dat <- dat %>%
filter_all(all_vars(!is.na(.)))
if (diagnose & (nOrig > nrow(dat))) {
cat("\nDropping",
nOrig-nrow(dat),
"rows with",
sumOrig-sum(dat$n),
"observations due to NA\n"
)
}
}
# Create the main plot
p <- dat %>%
ggplot(aes_string(x=x, y="n")) +
geom_col() +
labs(title=title,
subtitle=paste0("Counts By: ", mapper[x]),
x=paste0(x, " - ", mapper[x]),
y="Count"
)
# If the rotateOn criteria is exceeded, rotate the x-axis by 90 degrees
if (nrow(dat) >= rotateOn) {
p <- p + theme(axis.text.x=element_text(angle=90))
}
# If facetting has been requested, facet by the desired variable
if (!is.null(facetOn)) {
p <- p + facet_wrap(as.formula(paste("~", facetOn)))
}
# If showCentral=TRUE, add a dot plot for the overall average
if (showCentral) {
# Get the median number of observations by facet, or the total if facetOn=NULL
if (is.null(facetOn)) {
useN <- sum(dat$n)
} else {
useN <- dat %>%
group_by_at(vars(all_of(facetOn))) %>%
summarize(n=sum(n)) %>%
pull(n) %>%
median()
}
# Get the overall percentages by x
centralData <- helperCountsByMetric(tbl=dat, ctVar=x, sumOn="n") %>%
mutate(centralValue=nPct*useN)
# Apply the median
p <- p + geom_point(data=centralData, aes(y=centralValue), color="red", size=2)
}
# Print the plot
print(p)
}
}
# Example for Detroit 2016 - using WindDir, cType1, month, wType
plotcountsByMetric(kdtw_2016,
mets=c("WindDir", "cType1", "month", "wType"),
title="Detroit, MI (2016)"
)
The function plotNumCor() plots two numeric variables against one another:
# Create a function for plotting two variables against each other
plotNumCor <- function(met,
var1,
var2,
title=NULL,
subT="",
dropNA=TRUE,
diagnose=FALSE,
mapper=varMapper,
facetOn=NULL,
showCentral=FALSE
) {
# Function arguments
# met: dataframe or tibble containing raw data
# var1: character vector of variable to be used for the x-axis
# var2: character vector of variable to be used for the y-axis
# title: character vector for plot title
# subT: character vector for plot subtitle
# dropNA: boolean for whether to drop all NA prior to plotting (recommended for avoiding warnings)
# diagnose: boolean for whether to note in the log the number of NA observations dropped
# mapper: named list containing mapping from variable name to well-formatted name for titles and axes
# facetOn: a facetting variable for the supplied met (NULL for no faceting)
# showCentral: boolean for whether to show the central tendency over-plotted on the main data
# Function usage
# 1. By default, the function plots overall counts by the provided x/y metrics, with each point sized based on the number of observations, and with an lm smooth overlaid
# 2. If facetOn is passed as a non-NULL, then the data in #1 will be facetted by facetOn
# 3. If showCentral=TRUE, then the lm smooth that best first to the overall data will be plotted (only makes sense if facetOn has been selected)
# Create the title if not passed
if (is.null(title)) {
title <- paste0("Hourly Observations of ", mapper[var1], " and ", mapper[var2])
}
# If a facetting variable is provided, need to include this in the group_by
useVars <- c(var1, var2)
if (!is.null(facetOn)) { useVars <- c(facetOn, useVars) }
# Pull the counts by useVars
dat <- met %>%
group_by_at(vars(all_of(useVars))) %>%
summarize(n=n())
# If NA requested to be excluded, remove anything with NA
if (dropNA) {
nOrig <- nrow(dat)
sumOrig <- sum(dat$n)
dat <- dat %>%
filter_all(all_vars(!is.na(.)))
if (diagnose) {
cat("\nDropping",
nOrig-nrow(dat),
"rows with",
sumOrig-sum(dat$n),
"observations due to NA\n"
)
}
}
p <- dat %>%
ggplot(aes_string(x=var1, y=var2)) +
geom_point(alpha=0.5, aes_string(size="n")) +
geom_smooth(method="lm", aes_string(weight="n")) +
labs(x=paste0(mapper[var1], " - ", var1),
y=paste0(mapper[var2], " - ", var2),
title=title,
subtitle=subT
)
# If facetting has been requested, facet by the desired variable
if (!is.null(facetOn)) {
p <- p + facet_wrap(as.formula(paste("~", facetOn)))
}
# If showCentral=TRUE, add a dashed line for the overall data
if (showCentral) {
p <- p + helperNumCor(dat, xVar=var1, yVar=var2, sumOn="n")
}
print(p)
}
# Example for Detroit 2016 - using TempC and TempF
plotNumCor(kdtw_2016, var1="TempC", var2="TempF", subT="Detroit, MI (2016)", diagnose=TRUE)
##
## Dropping 1 rows with 49 observations due to NA
# Example for Detroit 2016 - using TempC and DewC
plotNumCor(kdtw_2016, var1="TempC", var2="DewC", subT="Detroit, MI (2016)", diagnose=TRUE)
##
## Dropping 1 rows with 49 observations due to NA
# Example for Detroit 2016 - using Altimeter and modSLP
plotNumCor(kdtw_2016, var1="Altimeter", var2="modSLP", subT="Detroit, MI (2016)", diagnose=TRUE)
##
## Dropping 1 rows with 49 observations due to NA
The function plotFactorNumeric() creates box plots for a numeric variable against a factor variable:
# Updated function for plotting numeric by factor
plotFactorNumeric <- function(met,
fctVar,
numVar,
title=NULL,
subT="",
diagnose=TRUE,
showXLabel=TRUE,
mapper=varMapper,
facetOn=NULL,
showCentral=FALSE
) {
# Function arguments
# met: dataframe or tibble containing raw data
# fctVar: character vector of variable to be used for the x-axis (factor in the boxplot)
# numVar: character vector of variable to be used for the y-axis (numeric in the boxplot)
# title: character vector for plot title
# subT: character vector for plot subtitle
# diagnose: boolean for whether to note in the log the number of NA observations dropped
# showXLabel: boolean for whether to include the x-label (e.g., set to FALSE if using 'month')
# mapper: named list containing mapping from variable name to well-formatted name for titles and axes
# facetOn: a facetting variable for the supplied met (NULL for no faceting)
# showCentral: boolean for whether to show the central tendency over-plotted on the main data
# Function usage
# 1. By default, the function creates the boxplot of numVar by fctVar
# 2. If facetOn is passed as a non-NULL, then the data in #1 will be facetted by facetOn
# 3. If showCentral=TRUE, then the overall median of numVar by fctVar will be plotted as a red dot
# Create the title if not passed
if (is.null(title)) {
title <- paste0("Hourly Observations of ", mapper[numVar], " by ", mapper[fctVar])
}
# Remove the NA variables
nOrig <- nrow(met)
dat <- met %>%
filter(!is.na(get(fctVar)), !is.na(get(numVar)))
if (diagnose) { cat("\nRemoving", nOrig-nrow(dat), "records due to NA\n") }
# Create the base plot
p <- dat %>%
ggplot(aes_string(x=fctVar, y=numVar)) +
geom_boxplot(fill="lightblue") +
labs(title=title,
subtitle=subT,
x=ifelse(showXLabel, paste0(mapper[fctVar], " - ", fctVar), ""),
y=paste0(mapper[numVar], " - ", numVar)
)
# If facetting has been requested, facet by the desired variable
if (!is.null(facetOn)) {
p <- p + facet_wrap(as.formula(paste("~", facetOn)))
}
# If showCentral=TRUE, add a dot plot for the overall average
if (showCentral) {
centData <- helperFactorNumeric(dat, .f=median, byVar=fctVar, numVar=numVar)
p <- p + geom_point(data=centData, aes(y=helpFN), size=2, color="red")
}
# Render the final plot
print(p)
}
# Example for Detroit 2016 - using TempF and month
plotFactorNumeric(kdtw_2016,
fctVar="month",
numVar="TempF",
subT="Detroit, MI (2016)",
showXLabel=FALSE,
diagnose=TRUE
)
##
## Removing 49 records due to NA
# Example for Detroit 2016 - using WindSpeed and wType
plotFactorNumeric(kdtw_2016,
fctVar="wType",
numVar="WindSpeed",
subT="Detroit, MI (2016)",
showXLabel=TRUE,
diagnose=TRUE
)
##
## Removing 49 records due to NA
# Example for Detroit 2016 - using Visibility and wType
plotFactorNumeric(kdtw_2016,
fctVar="wType",
numVar="Visibility",
subT="Detroit, MI (2016)",
showXLabel=TRUE,
diagnose=TRUE
)
##
## Removing 49 records due to NA
An issue previous observed where visibility 1/16SM was interpreted as 16 statutory miles has been corrected in the ‘WeatherDownloads_202005_v002’ file.
The function corMETAR() calculates correlations among numeric variables in the METAR data:
# Function to calculate, display, and plot variable correlations
corMETAR <- function(met, numVars, subT="") {
# Keep only complete cases and report on data kept
dfUse <- met %>%
select_at(vars(all_of(numVars))) %>%
filter(complete.cases(.))
nU <- nrow(dfUse)
nM <- nrow(met)
myPct <- round(100*nU/nM, 1)
cat("\n *** Correlations use ", nU, " complete cases (", myPct, "% of ", nM, " total) ***\n", sep="")
# Create the correlation matrix
mtxCorr <- dfUse %>%
cor()
# Print the correlations
mtxCorr %>%
round(2) %>%
print()
# Display a heat map
corrplot::corrplot(mtxCorr,
method="color",
title=paste0("Hourly Weather Correlations\n", subT),
mar=c(0, 0, 2, 0)
)
}
# Example for Detroit, MI 2016
coreNum <- c("TempC", "TempF", "DewC", "DewF",
"Altimeter", "modSLP", "WindSpeed", "Visibility"
)
corMETAR(kdtw_2016, numVars=coreNum, subT="Detroit, MI (2016) METAR")
##
## *** Correlations use 8769 complete cases (99.4% of 8818 total) ***
## TempC TempF DewC DewF Altimeter modSLP WindSpeed Visibility
## TempC 1.00 1.00 0.92 0.92 -0.17 -0.24 -0.10 0.14
## TempF 1.00 1.00 0.92 0.92 -0.17 -0.24 -0.10 0.14
## DewC 0.92 0.92 1.00 1.00 -0.22 -0.28 -0.18 -0.07
## DewF 0.92 0.92 1.00 1.00 -0.22 -0.28 -0.18 -0.07
## Altimeter -0.17 -0.17 -0.22 -0.22 1.00 1.00 -0.37 0.15
## modSLP -0.24 -0.24 -0.28 -0.28 1.00 1.00 -0.35 0.14
## WindSpeed -0.10 -0.10 -0.18 -0.18 -0.37 -0.35 1.00 0.10
## Visibility 0.14 0.14 -0.07 -0.07 0.15 0.14 0.10 1.00
The function lmMETAR() runs simple linear regression models on the METAR data:
# Function for linear regressions on METAR data
lmMETAR <- function(met,
y,
x,
yName,
subT=""
) {
# Convert to formula
myChar <- paste0(y, " ~ ", x)
cat("\n *** Regression call is:", myChar, "***\n")
# Run regression
regr <- lm(formula(myChar), data=met)
# Summarize regression
print(summary(regr))
# Predict the new values
pred <- predict(regr, newdata=met)
# Plot the predictions
p <- met %>%
select_at(vars(all_of(y))) %>%
mutate(pred=pred) %>%
filter_all(all_vars(!is.na(.))) %>%
group_by_at(vars(all_of(c(y, "pred")))) %>%
summarize(n=n()) %>%
ggplot(aes_string(x=y, y="pred")) +
geom_point(aes(size=n), alpha=0.25) +
geom_smooth(aes(weight=n), method="lm") +
labs(title=paste0("Predicted vs. Actual ", yName, " - ", x, " as Predictor"),
subtitle=subT,
x=paste0("Actual ", yName),
y=paste0("Predicted ", yName)
)
print(p)
}
# Examples for Detroit, MI 2016
lmMETAR(kdtw_2016, "modSLP", "Altimeter", yName="Sea Level Pressure", subT="Detroit, MI (2016)")
##
## *** Regression call is: modSLP ~ Altimeter ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96361 -0.44022 -0.03758 0.40772 1.41448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -16.44389 0.74723 -22.01 <2e-16 ***
## Altimeter 34.41514 0.02488 1383.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4961 on 8767 degrees of freedom
## (49 observations deleted due to missingness)
## Multiple R-squared: 0.9954, Adjusted R-squared: 0.9954
## F-statistic: 1.913e+06 on 1 and 8767 DF, p-value: < 2.2e-16
lmMETAR(kdtw_2016, "modSLP", "Altimeter + TempF", yName="Sea Level Pressure", subT="Detroit, MI (2016)")
##
## *** Regression call is: modSLP ~ Altimeter + TempF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57802 -0.12674 0.00481 0.12820 0.65708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.713e+00 2.803e-01 -13.25 <2e-16 ***
## Altimeter 3.403e+01 9.302e-03 3658.80 <2e-16 ***
## TempF -2.351e-02 9.941e-05 -236.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1826 on 8766 degrees of freedom
## (49 observations deleted due to missingness)
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9994
## F-statistic: 7.086e+06 on 2 and 8766 DF, p-value: < 2.2e-16
lmMETAR(kdtw_2016, "TempC", "DewC", yName="Temperature (C)", subT="Detroit, MI (2016)")
##
## *** Regression call is: TempC ~ DewC ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.174 -3.172 -1.165 2.822 17.828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.17084 0.05373 114.9 <2e-16 ***
## DewC 0.99909 0.00470 212.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.463 on 8767 degrees of freedom
## (49 observations deleted due to missingness)
## Multiple R-squared: 0.8375, Adjusted R-squared: 0.8375
## F-statistic: 4.519e+04 on 1 and 8767 DF, p-value: < 2.2e-16
The basicWindPlots() function creates plots for wind speed and direction:
# Generate basic wind plots
basicWindPlots <- function(met,
dirVar="WindDir",
spdVar="WindSpeed",
desc="",
gran="",
mapper=varMapper
) {
# Plot for the wind direction
wDir <- met %>%
ggplot(aes_string(x=dirVar)) +
geom_bar() +
labs(title=paste0(desc, " Wind Direction"), subtitle=gran,
y="# Hourly Observations", x=mapper[dirVar]
) +
theme(axis.text.x=element_text(angle=90))
print(wDir)
# Plot for the minimum, average, and maximum wind speed by wind direction
# Wind direction 000 is reserved for 0 KT wind, while VRB is reserved for 3-6 KT variable winds
wSpeedByDir <- met %>%
filter(!is.na(get(dirVar))) %>%
group_by_at(vars(all_of(dirVar))) %>%
summarize(minWind=min(get(spdVar)), meanWind=mean(get(spdVar)), maxWind=max(get(spdVar))) %>%
ggplot(aes_string(x=dirVar)) +
geom_point(aes(y=meanWind), color="red", size=2) +
geom_errorbar(aes(ymin=minWind, ymax=maxWind)) +
labs(title=paste0(desc, " Wind Speed (Max, Mean, Min) By Wind Direction"),
subtitle=gran,
y=mapper[spdVar],
x=mapper[dirVar]
) +
theme(axis.text.x=element_text(angle=90))
print(wSpeedByDir)
# Plot for the wind speed
pctZero <- sum(pull(met, spdVar)==0, na.rm=TRUE) / nrow(met)
wSpeed <- met %>%
filter_at(vars(all_of(spdVar)), all_vars(!is.na(.))) %>%
ggplot(aes_string(x=spdVar)) +
geom_bar(aes(y=..count../sum(..count..))) +
labs(title=paste0(round(100*pctZero), "% of wind speeds in ", desc, " measure 0 Knots"),
subtitle=gran,
y="% Hourly Observations",
x=mapper[spdVar]
)
print(wSpeed)
# Polar plot for wind speed and wind direction
wData <- met %>%
filter_at(vars(all_of(dirVar)), all_vars(!is.na(.) & !(. %in% c("000", "VRB")))) %>%
filter_at(vars(all_of(spdVar)), all_vars(!is.na(.))) %>%
mutate_at(vars(all_of(dirVar)), as.numeric) %>%
group_by_at(vars(all_of(c(dirVar, spdVar)))) %>%
summarize(n=n())
wPolarDirSpeed <- wData %>%
ggplot(aes_string(x=spdVar, y=dirVar)) +
geom_point(alpha=0.1, aes(size=n)) +
coord_polar(theta="y") +
labs(title=paste0(desc, " Direction vs. Wind Speed"),
subtitle=gran,
x=mapper[spdVar],
y=mapper[dirVar]
) +
scale_y_continuous(limits=c(0, 360), breaks=c(0, 90, 180, 270, 360)) +
scale_x_continuous(limits=c(0, 40), breaks=c(0, 5, 10, 15, 20, 25, 30, 35, 40)) +
geom_point(aes(x=0, y=0), color="red", size=2)
print(wPolarDirSpeed)
}
# Example for Detroit, MI 2016
basicWindPlots(kdtw_2016, desc="Detroit, MI (2016)", gran="KDTW METAR")
The getWindDirGroup() function maps wind direction to a category such as NNE. Because the METAR data are recorded in units of 10 degrees, either 4 groupings (90 degrees each) or 12 groupings (30 degrees each) are preferred, so that each category has the same underlying number of buckets:
# Extract the wind direction data from a processed METAR file
getWindDirGroup <- function(met, src) {
# Use the fullMETAR data and extract WindDir, WindSpeed, month
windPlotData <- met %>%
select(WindDir, WindSpeed, month) %>%
mutate(windDirGroup=factor(case_when(WindSpeed==0 ~ "No Wind",
WindDir=="VRB" ~ "Variable",
WindDir %in% c("350", "360", "010") ~ "N",
WindDir %in% c("020", "030", "040") ~ "NNE",
WindDir %in% c("050", "060", "070") ~ "ENE",
WindDir %in% c("080", "090", "100") ~ "E",
WindDir %in% c("110", "120", "130") ~ "ESE",
WindDir %in% c("140", "150", "160") ~ "SSE",
WindDir %in% c("170", "180", "190") ~ "S",
WindDir %in% c("200", "210", "220") ~ "SSW",
WindDir %in% c("230", "240", "250") ~ "WSW",
WindDir %in% c("260", "270", "280") ~ "W",
WindDir %in% c("290", "300", "310") ~ "WNW",
WindDir %in% c("320", "330", "340") ~ "NNW",
TRUE ~ "Error"
),
levels=c("No Wind", "Variable", "Error",
"N", "NNE", "ENE",
"E", "ESE", "SSE",
"S", "SSW", "WSW",
"W", "WNW", "NNW"
)
)
)
# Rempve the errors and calculate percentages by month for the remainder
processedWindData <- windPlotData %>%
filter(windDirGroup != "Error") %>%
group_by(month, windDirGroup) %>%
summarize(n=n()) %>%
ungroup() %>%
group_by(month) %>%
mutate(pct=n/sum(n)) %>%
ungroup() %>%
mutate(src=src)
processedWindData
}
The function conslidatePlotWind() then calls getWindDirGroup() for any number of files:
# Consolidate and plot wind data
consolidatePlotWind <- function(files, names) {
consFun <- function(x, y) { getWindDirGroup(met=x, src=y) }
boundByRows <- map2_dfr(.x=files, .y=names, .f=consFun)
# Show frequency by month for each city, faceted by wind direction
p1 <- boundByRows %>%
ggplot(aes(x=month, y=pct, color=src)) +
geom_line(aes(group=src)) +
facet_wrap(~windDirGroup) +
labs(title="Wind Frequency by Month",
x="Month",
y="Frequency of Wind Observations"
) +
theme(axis.text.x=element_text(angle=90))
print(p1)
# Show frequency by wind direction for each city, faceted by month
p2 <- boundByRows %>%
ggplot(aes(x=windDirGroup, y=pct, color=src)) +
geom_line(aes(group=src)) +
facet_wrap(~month) +
labs(title="Wind Frequency by Wind Direction",
x="Wind Direction",
y="Frequency of Wind Observations"
) +
theme(axis.text.x=element_text(angle=90))
print(p2)
boundByRows
}
# Load the Las Vegas data and New Orleans data for comparison
kmsy_2016 <- readRDS("./RInputFiles/ProcessedMETAR/metar_kmsy_2016.rds")
klas_2016 <- readRDS("./RInputFiles/ProcessedMETAR/metar_klas_2016.rds")
# Run wind by month comparisons for Detroit, Las Vegas, New Orleans
consolidatePlotWind(files=list(kdtw_2016, klas_2016, kmsy_2016),
names=c("Detroit, MI (2016)", "Las Vegas, NV (2016)", "New Orleans, LA (2016)")
)
## # A tibble: 504 x 5
## month windDirGroup n pct src
## <fct> <fct> <int> <dbl> <chr>
## 1 Jan No Wind 46 0.0600 Detroit, MI (2016)
## 2 Jan Variable 3 0.00391 Detroit, MI (2016)
## 3 Jan N 26 0.0339 Detroit, MI (2016)
## 4 Jan NNE 26 0.0339 Detroit, MI (2016)
## 5 Jan ENE 9 0.0117 Detroit, MI (2016)
## 6 Jan E 16 0.0209 Detroit, MI (2016)
## 7 Jan ESE 14 0.0183 Detroit, MI (2016)
## 8 Jan SSE 72 0.0939 Detroit, MI (2016)
## 9 Jan S 131 0.171 Detroit, MI (2016)
## 10 Jan SSW 158 0.206 Detroit, MI (2016)
## # ... with 494 more rows
The functions can be combined so that a full process can be run for a given file:
# File name to city name mapper
cityNameMapper <- c(kdtw_2016="Detroit, MI (2016)",
kewr_2016="Newark, NJ (2016)",
kgrb_2016="Green Bay, WI (2016)",
kgrr_2016="Grand Rapids, MI (2016)",
kiah_2016="Houston, TX (2016)",
kind_2016="Indianapolis, IN (2016)",
klas_2015="Las Vegas, NV (2015)",
klas_2016="Las Vegas, NV (2016)",
klas_2017="Las Vegas, NV (2017)",
klnk_2016="Lincoln, NE (2016)",
kmke_2016="Milwaukee, WI (2016)",
kmsn_2016="Madison, WI (2016)",
kmsp_2016="Minneapolis, MN (2016)",
kmsy_2015="New Orleans, LA (2015)",
kmsy_2016="New Orleans, LA (2016)",
kmsy_2017="New Orleans, LA (2017)",
kord_2015="Chicago, IL (2015)",
kord_2016="Chicago, IL (2016)",
kord_2017="Chicago, IL (2017)",
ksan_2015="San Diego, CA (2015)",
ksan_2016="San Diego, CA (2016)",
ksan_2017="San Diego, CA (2017)",
ktvc_2016="Traverse City, MI (2016)"
)
# This is a helper function to create a locale description
getLocaleDescription <- function(x, mapper=cityNameMapper) {
# Initialize the description as NULL
desc <- NULL
for (potMatch in names(mapper)) {
if (str_detect(string=x, pattern=potMatch)) {
desc <- mapper[potMatch]
break
}
}
# If the mapping failed, use UNMAPPED_x as the description
if (is.null(desc)) {
desc <- paste0("UNMAPPED_", x)
cat("\nUnable to find a description, will use ", desc, "\n\n", sep="")
} else {
cat("\nWill use ", desc, " as the description for ", x, "\n\n", sep="")
}
# Return the descriptive name
desc
}
# The following function runs the functions that work on a single data source
combinedEDA <- function(filename=NULL,
tbl=NULL,
desc=NULL,
mets=c("WindDir", "WindSpeed", "TempC", "DewC", "Altimeter",
"modSLP", "cType1", "cLevel1", "month", "day"
),
corPairs=list(c("TempC", "TempF"),
c("TempC", "DewC"),
c("Altimeter", "modSLP"),
c("Altimeter", "WindSpeed")
),
fctPairs=list(c("month", "TempF"),
c("month", "DewF"),
c("month", "WindSpeed"),
c("month", "Altimeter"),
c("wType", "Visibility"),
c("wType", "WindSpeed"),
c("WindDir", "WindSpeed"),
c("WindDir", "TempF")
),
heatVars=c("TempC", "TempF",
"DewC", "DewF",
"Altimeter", "modSLP",
"WindSpeed", "Visibility",
"monthint", "day"
),
lmVars=list(c("modSLP", "Altimeter"),
c("modSLP", "Altimeter + TempF"),
c("TempF", "DewF"),
c("WindSpeed", "Altimeter + TempF + DewF + month")
),
mapVariables=varMapper,
mapFileNames=cityNameMapper,
path="./RInputFiles/ProcessedMETAR/"
) {
# Require that either filename OR tbl be passed
if (is.null(filename) & is.null(tbl)) {
cat("\nMust provide either a filename or an already-loaded tibble\n")
stop("\nfilename=NULL and tbl=NULL may not both be passed to combinedEDA()\n")
}
# Require that either 1) filename and mapFileNames, OR 2) desc be passed
if ((is.null(filename) | is.null(mapFileNames)) & is.null(desc)) {
cat("\nMust provide either a filename with mapFileNames or a file description\n")
stop("\nWhen desc=NULL must have non-null entries for both filename= and mapFileNames=\n")
}
# Find the description if it is NULL (default)
if (is.null(desc)) {
desc <- getLocaleDescription(filename, mapper=mapFileNames)
}
# Warn if both filename and tbl are passed, since tbl will be used
if (!is.null(filename) & !is.null(tbl)) {
cat("\nA tibble has been passed and will be used as the dataset for this function\n")
warning("\nArgument filename=", filename, " is NOT loaded since a tibble was passed\n")
}
# Read in the file unless tbl has already been passed to the routine
if (is.null(tbl)) {
tbl <- readRDS(paste0(path, filename))
}
# Plot counts by metric
plotcountsByMetric(tbl, mets=mets, title=desc, diagnose=TRUE)
# Plot relationships between two variables
for (ys in corPairs) {
plotNumCor(tbl, var1=ys[1], var2=ys[2], subT=desc, diagnose=TRUE)
}
# plot numeric vs. factor
for (ys in fctPairs) {
plotFactorNumeric(tbl, fctVar=ys[1], numVar=ys[2], subT=desc, showXLabel=FALSE, diagnose=TRUE)
}
# Heatmap for variable correlations
corMETAR(tbl, numVars=heatVars, subT=paste0(desc, " METAR"))
# Run linear rergression
for (ys in lmVars) {
lmMETAR(tbl, y=ys[1], x=ys[2], yName=varMapper[ys[1]], subT=desc)
}
# Run the basic wind plots
basicWindPlots(tbl, desc=desc, gran="")
# Return the tibble
tbl
}
The process can then be run for each of Detroit (2016), Las Vegas (2016), and New Orleans (2016):
# Retrieve the Detroit, MI (2016) data
kdtw_2016 <- combinedEDA("metar_kdtw_2016.rds")
##
## Will use Detroit, MI (2016) as the description for metar_kdtw_2016.rds
##
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 651 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Removing 49 records due to NA
##
## Removing 49 records due to NA
##
## Removing 49 records due to NA
##
## Removing 49 records due to NA
##
## Removing 49 records due to NA
##
## Removing 49 records due to NA
##
## Removing 49 records due to NA
##
## Removing 49 records due to NA
##
## *** Correlations use 8769 complete cases (99.4% of 8818 total) ***
## TempC TempF DewC DewF Altimeter modSLP WindSpeed Visibility
## TempC 1.00 1.00 0.92 0.92 -0.17 -0.24 -0.10 0.14
## TempF 1.00 1.00 0.92 0.92 -0.17 -0.24 -0.10 0.14
## DewC 0.92 0.92 1.00 1.00 -0.22 -0.28 -0.18 -0.07
## DewF 0.92 0.92 1.00 1.00 -0.22 -0.28 -0.18 -0.07
## Altimeter -0.17 -0.17 -0.22 -0.22 1.00 1.00 -0.37 0.15
## modSLP -0.24 -0.24 -0.28 -0.28 1.00 1.00 -0.35 0.14
## WindSpeed -0.10 -0.10 -0.18 -0.18 -0.37 -0.35 1.00 0.10
## Visibility 0.14 0.14 -0.07 -0.07 0.15 0.14 0.10 1.00
## monthint 0.24 0.24 0.32 0.32 0.19 0.17 -0.06 -0.11
## day 0.04 0.04 0.03 0.03 -0.02 -0.02 0.06 0.04
## monthint day
## TempC 0.24 0.04
## TempF 0.24 0.04
## DewC 0.32 0.03
## DewF 0.32 0.03
## Altimeter 0.19 -0.02
## modSLP 0.17 -0.02
## WindSpeed -0.06 0.06
## Visibility -0.11 0.04
## monthint 1.00 0.02
## day 0.02 1.00
##
## *** Regression call is: modSLP ~ Altimeter ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96361 -0.44022 -0.03758 0.40772 1.41448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -16.44389 0.74723 -22.01 <2e-16 ***
## Altimeter 34.41514 0.02488 1383.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4961 on 8767 degrees of freedom
## (49 observations deleted due to missingness)
## Multiple R-squared: 0.9954, Adjusted R-squared: 0.9954
## F-statistic: 1.913e+06 on 1 and 8767 DF, p-value: < 2.2e-16
##
## *** Regression call is: modSLP ~ Altimeter + TempF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57802 -0.12674 0.00481 0.12820 0.65708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.713e+00 2.803e-01 -13.25 <2e-16 ***
## Altimeter 3.403e+01 9.302e-03 3658.80 <2e-16 ***
## TempF -2.351e-02 9.941e-05 -236.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1826 on 8766 degrees of freedom
## (49 observations deleted due to missingness)
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9994
## F-statistic: 7.086e+06 on 2 and 8766 DF, p-value: < 2.2e-16
##
## *** Regression call is: TempF ~ DewF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.132 -6.065 -2.083 4.042 31.029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.060663 0.211976 52.18 <2e-16 ***
## DewF 1.000977 0.004677 214.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.987 on 8767 degrees of freedom
## (49 observations deleted due to missingness)
## Multiple R-squared: 0.8393, Adjusted R-squared: 0.8393
## F-statistic: 4.58e+04 on 1 and 8767 DF, p-value: < 2.2e-16
##
## *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.3544 -2.5454 -0.1494 2.3669 20.6042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 306.827720 6.726512 45.615 < 2e-16 ***
## Altimeter -9.980907 0.222840 -44.790 < 2e-16 ***
## TempF 0.206384 0.006044 34.145 < 2e-16 ***
## DewF -0.217533 0.006755 -32.204 < 2e-16 ***
## monthFeb 0.188758 0.204057 0.925 0.3550
## monthMar -1.375926 0.211244 -6.513 7.75e-11 ***
## monthApr -2.406187 0.217870 -11.044 < 2e-16 ***
## monthMay -3.616016 0.248432 -14.555 < 2e-16 ***
## monthJun -4.337974 0.277532 -15.631 < 2e-16 ***
## monthJul -3.428235 0.300872 -11.394 < 2e-16 ***
## monthAug -2.971582 0.312087 -9.522 < 2e-16 ***
## monthSep -1.839206 0.292614 -6.285 3.43e-10 ***
## monthOct -0.477868 0.255640 -1.869 0.0616 .
## monthNov -0.013008 0.226830 -0.057 0.9543
## monthDec 2.071836 0.200848 10.315 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.883 on 8754 degrees of freedom
## (49 observations deleted due to missingness)
## Multiple R-squared: 0.3218, Adjusted R-squared: 0.3207
## F-statistic: 296.7 on 14 and 8754 DF, p-value: < 2.2e-16
# Retrieve the Las Vegas, NV (2016) data
klas_2016 <- combinedEDA("metar_klas_2016.rds")
##
## Will use Las Vegas, NV (2016) as the description for metar_klas_2016.rds
##
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 2440 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Removing 35 records due to NA
##
## Removing 35 records due to NA
##
## Removing 35 records due to NA
##
## Removing 35 records due to NA
##
## Removing 35 records due to NA
##
## Removing 35 records due to NA
##
## Removing 35 records due to NA
##
## Removing 35 records due to NA
##
## *** Correlations use 8783 complete cases (99.6% of 8818 total) ***
## TempC TempF DewC DewF Altimeter modSLP WindSpeed Visibility
## TempC 1.00 1.00 0.22 0.22 -0.51 -0.62 0.22 0.01
## TempF 1.00 1.00 0.22 0.22 -0.51 -0.62 0.22 0.01
## DewC 0.22 0.22 1.00 1.00 -0.24 -0.27 -0.04 -0.13
## DewF 0.22 0.22 1.00 1.00 -0.24 -0.27 -0.04 -0.13
## Altimeter -0.51 -0.51 -0.24 -0.24 1.00 0.99 -0.38 0.06
## modSLP -0.62 -0.62 -0.27 -0.27 0.99 1.00 -0.38 0.06
## WindSpeed 0.22 0.22 -0.04 -0.04 -0.38 -0.38 1.00 -0.02
## Visibility 0.01 0.01 -0.13 -0.13 0.06 0.06 -0.02 1.00
## monthint 0.14 0.14 0.06 0.06 0.06 0.03 -0.01 -0.02
## day -0.01 -0.01 0.03 0.03 -0.01 -0.01 0.00 -0.07
## monthint day
## TempC 0.14 -0.01
## TempF 0.14 -0.01
## DewC 0.06 0.03
## DewF 0.06 0.03
## Altimeter 0.06 -0.01
## modSLP 0.03 -0.01
## WindSpeed -0.01 0.00
## Visibility -0.02 -0.07
## monthint 1.00 0.02
## day 0.02 1.00
##
## *** Regression call is: modSLP ~ Altimeter ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1381 -0.7848 -0.0607 0.6891 3.6362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -116.30984 1.72975 -67.24 <2e-16 ***
## Altimeter 37.68826 0.05776 652.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9753 on 8781 degrees of freedom
## (35 observations deleted due to missingness)
## Multiple R-squared: 0.9798, Adjusted R-squared: 0.9798
## F-statistic: 4.258e+05 on 1 and 8781 DF, p-value: < 2.2e-16
##
## *** Regression call is: modSLP ~ Altimeter + TempF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18161 -0.33113 0.02395 0.33395 1.08509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.537e+01 8.693e-01 -29.18 <2e-16 ***
## Altimeter 3.478e+01 2.868e-02 1212.89 <2e-16 ***
## TempF -5.581e-02 2.813e-04 -198.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4165 on 8780 degrees of freedom
## (35 observations deleted due to missingness)
## Multiple R-squared: 0.9963, Adjusted R-squared: 0.9963
## F-statistic: 1.187e+06 on 2 and 8780 DF, p-value: < 2.2e-16
##
## *** Regression call is: TempF ~ DewF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.026 -14.381 -0.341 13.299 46.312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.12523 0.49251 126.14 <2e-16 ***
## DewF 0.32810 0.01576 20.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.94 on 8781 degrees of freedom
## (35 observations deleted due to missingness)
## Multiple R-squared: 0.04705, Adjusted R-squared: 0.04694
## F-statistic: 433.5 on 1 and 8781 DF, p-value: < 2.2e-16
##
## *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1199 -2.7665 -0.6134 2.0975 22.2630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 316.166903 9.583584 32.990 < 2e-16 ***
## Altimeter -10.363527 0.316290 -32.766 < 2e-16 ***
## TempF 0.012196 0.005239 2.328 0.019933 *
## DewF -0.053040 0.004107 -12.915 < 2e-16 ***
## monthFeb 1.477528 0.225729 6.546 6.26e-11 ***
## monthMar 0.848312 0.232319 3.652 0.000262 ***
## monthApr 1.590035 0.242532 6.556 5.84e-11 ***
## monthMay 0.327608 0.261841 1.251 0.210905
## monthJun 0.141204 0.319290 0.442 0.658325
## monthJul 1.092983 0.328362 3.329 0.000876 ***
## monthAug 0.545515 0.319165 1.709 0.087450 .
## monthSep 0.647423 0.281377 2.301 0.021420 *
## monthOct 1.147640 0.253286 4.531 5.95e-06 ***
## monthNov 1.097005 0.227198 4.828 1.40e-06 ***
## monthDec 0.672432 0.214810 3.130 0.001752 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.181 on 8768 degrees of freedom
## (35 observations deleted due to missingness)
## Multiple R-squared: 0.1774, Adjusted R-squared: 0.176
## F-statistic: 135 on 14 and 8768 DF, p-value: < 2.2e-16
# Retrieve the New Orleans, LA (2016) data
kmsy_2016 <- combinedEDA("metar_kmsy_2016.rds")
##
## Will use New Orleans, LA (2016) as the description for metar_kmsy_2016.rds
##
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 1029 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Removing 14 records due to NA
##
## Removing 14 records due to NA
##
## Removing 14 records due to NA
##
## Removing 14 records due to NA
##
## Removing 14 records due to NA
##
## Removing 14 records due to NA
##
## Removing 14 records due to NA
##
## Removing 14 records due to NA
##
## *** Correlations use 8799 complete cases (99.8% of 8813 total) ***
## TempC TempF DewC DewF Altimeter modSLP WindSpeed Visibility
## TempC 1.00 1.00 0.85 0.85 -0.48 -0.47 -0.09 0.13
## TempF 1.00 1.00 0.85 0.85 -0.48 -0.48 -0.09 0.13
## DewC 0.85 0.85 1.00 1.00 -0.56 -0.56 -0.21 -0.06
## DewF 0.85 0.85 1.00 1.00 -0.57 -0.56 -0.21 -0.05
## Altimeter -0.48 -0.48 -0.56 -0.57 1.00 1.00 -0.08 0.08
## modSLP -0.47 -0.48 -0.56 -0.56 1.00 1.00 -0.08 0.08
## WindSpeed -0.09 -0.09 -0.21 -0.21 -0.08 -0.08 1.00 0.03
## Visibility 0.13 0.13 -0.06 -0.05 0.08 0.08 0.03 1.00
## monthint 0.26 0.26 0.31 0.31 0.09 0.09 -0.22 -0.02
## day 0.03 0.03 0.07 0.07 0.02 0.02 -0.05 0.01
## monthint day
## TempC 0.26 0.03
## TempF 0.26 0.03
## DewC 0.31 0.07
## DewF 0.31 0.07
## Altimeter 0.09 0.02
## modSLP 0.09 0.02
## WindSpeed -0.22 -0.05
## Visibility -0.02 0.01
## monthint 1.00 0.02
## day 0.02 1.00
##
## *** Regression call is: modSLP ~ Altimeter ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.232515 -0.083700 0.000544 0.084787 0.227320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.773456 0.230067 12.05 <2e-16 ***
## Altimeter 33.779607 0.007654 4413.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1015 on 8797 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9995
## F-statistic: 1.948e+07 on 1 and 8797 DF, p-value: < 2.2e-16
##
## *** Regression call is: modSLP ~ Altimeter + TempF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.222066 -0.082359 -0.001036 0.083533 0.217460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.557e+00 2.638e-01 5.903 3.7e-09 ***
## Altimeter 3.382e+01 8.667e-03 3902.029 < 2e-16 ***
## TempF 8.751e-04 9.432e-05 9.277 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.101 on 8796 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.9996, Adjusted R-squared: 0.9996
## F-statistic: 9.833e+06 on 2 and 8796 DF, p-value: < 2.2e-16
##
## *** Regression call is: TempF ~ DewF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.051 -4.777 -1.265 4.199 26.477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.440028 0.319858 76.41 <2e-16 ***
## DewF 0.778877 0.005065 153.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.763 on 8797 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.7288, Adjusted R-squared: 0.7288
## F-statistic: 2.364e+04 on 1 and 8797 DF, p-value: < 2.2e-16
##
## *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9452 -2.5710 -0.2247 2.2592 18.9024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 190.986029 11.812622 16.168 < 2e-16 ***
## Altimeter -6.223347 0.387357 -16.066 < 2e-16 ***
## TempF 0.225378 0.007290 30.915 < 2e-16 ***
## DewF -0.169883 0.006513 -26.082 < 2e-16 ***
## monthFeb -0.787460 0.204817 -3.845 0.000122 ***
## monthMar -0.270554 0.215758 -1.254 0.209885
## monthApr -1.941174 0.225642 -8.603 < 2e-16 ***
## monthMay -3.763407 0.241926 -15.556 < 2e-16 ***
## monthJun -5.215285 0.269852 -19.326 < 2e-16 ***
## monthJul -5.568010 0.283515 -19.639 < 2e-16 ***
## monthAug -4.834769 0.277122 -17.446 < 2e-16 ***
## monthSep -5.769099 0.272789 -21.149 < 2e-16 ***
## monthOct -4.812814 0.245841 -19.577 < 2e-16 ***
## monthNov -2.827680 0.220138 -12.845 < 2e-16 ***
## monthDec -0.203742 0.210552 -0.968 0.333241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.847 on 8784 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.2278, Adjusted R-squared: 0.2266
## F-statistic: 185.1 on 14 and 8784 DF, p-value: < 2.2e-16
There is a lot of good information in the EDA, but it can be overwhelming to have everything in one place. Perhaps a wrapper function can be built allowing for outputs to be piped to a given file:
wrapCombinedEDA <- function(readFile,
readPath="./RInputFiles/ProcessedMETAR/",
mapFileNames=cityNameMapper,
desc=NULL,
writeLogFile=NULL,
writeLogPDF=NULL,
writeLogPath=NULL,
appendWriteFile=FALSE,
...
) {
# Read in the requested file
tbl <- readRDS(paste0(readPath, readFile))
# Find the description if it has not been passed
if (is.null(desc)) {
desc <- getLocaleDescription(readFile, mapper=mapFileNames)
}
# Helper function that only runs the combinedEDA() routine
coreFunc <- function() { combinedEDA(tbl=tbl, desc=desc, mapFileNames=mapFileNames, ...) }
# If writeLogPDF is not NULL, direct the graphs to a suitable PDF
if (!is.null(writeLogPDF)) {
# Prepend the provided log path if it has not been made available
if (!is.null(writeLogPath)) {
writeLogPDF <- paste0(writeLogPath, writeLogPDF)
}
# Provide the location of the EDA pdf file
cat("\nEDA PDF file is available at:", writeLogPDF, "\n")
# Redirect the writing to writeLogPDF
pdf(writeLogPDF)
}
# Run EDA on the tbl using capture.output to redirect to a log file if specified
if (!is.null(writeLogFile)) {
# Prepend the provided log path if it has not been made available
if (!is.null(writeLogPath)) {
writeLogFile <- paste0(writeLogPath, writeLogFile)
}
# Provide the location of the EDA log file
cat("\nEDA log file is available at:", writeLogFile, "\n")
# Run EDA such that the output goes to the log file
capture.output(coreFunc(),
file=writeLogFile,
append=appendWriteFile
)
} else {
# Run EDA such that output stays in stdout
coreFunc()
}
# If writeLogPDF is not NULL, redirect to stdout
if (!is.null(writeLogPDF)) {
dev.off()
}
# Return the tbl
tbl
}
filePath <- "./RInputFiles/ProcessedMETAR/"
# Example for the basic function for Chicago, IL (2016) written to stdout
kord_2016 <- wrapCombinedEDA("metar_kord_2016.rds", readPath=filePath)
##
## Will use Chicago, IL (2016) as the description for metar_kord_2016.rds
##
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 823 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Removing 10 records due to NA
##
## Removing 10 records due to NA
##
## Removing 10 records due to NA
##
## Removing 10 records due to NA
##
## Removing 10 records due to NA
##
## Removing 10 records due to NA
##
## Removing 10 records due to NA
##
## Removing 10 records due to NA
##
## *** Correlations use 8805 complete cases (99.9% of 8815 total) ***
## TempC TempF DewC DewF Altimeter modSLP WindSpeed Visibility
## TempC 1.00 1.00 0.93 0.93 -0.22 -0.29 -0.12 0.19
## TempF 1.00 1.00 0.93 0.93 -0.22 -0.29 -0.12 0.19
## DewC 0.93 0.93 1.00 1.00 -0.27 -0.34 -0.19 0.05
## DewF 0.93 0.93 1.00 1.00 -0.28 -0.34 -0.19 0.05
## Altimeter -0.22 -0.22 -0.27 -0.28 1.00 1.00 -0.31 0.19
## modSLP -0.29 -0.29 -0.34 -0.34 1.00 1.00 -0.29 0.17
## WindSpeed -0.12 -0.12 -0.19 -0.19 -0.31 -0.29 1.00 0.01
## Visibility 0.19 0.19 0.05 0.05 0.19 0.17 0.01 1.00
## monthint 0.24 0.24 0.27 0.27 0.18 0.16 -0.09 0.02
## day 0.05 0.05 0.05 0.05 -0.06 -0.06 0.08 0.04
## monthint day
## TempC 0.24 0.05
## TempF 0.24 0.05
## DewC 0.27 0.05
## DewF 0.27 0.05
## Altimeter 0.18 -0.06
## modSLP 0.16 -0.06
## WindSpeed -0.09 0.08
## Visibility 0.02 0.04
## monthint 1.00 0.02
## day 0.02 1.00
##
## *** Regression call is: modSLP ~ Altimeter ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.31041 -0.47561 -0.07852 0.43888 1.66927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.11596 0.85978 -26.89 <2e-16 ***
## Altimeter 34.63779 0.02864 1209.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5573 on 8803 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.994, Adjusted R-squared: 0.994
## F-statistic: 1.463e+06 on 1 and 8803 DF, p-value: < 2.2e-16
##
## *** Regression call is: modSLP ~ Altimeter + TempF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.69460 -0.12269 0.00055 0.12142 0.73628
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.603e+00 2.864e-01 -16.07 <2e-16 ***
## Altimeter 3.407e+01 9.502e-03 3585.30 <2e-16 ***
## TempF -2.606e-02 9.503e-05 -274.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1804 on 8802 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9994
## F-statistic: 7.018e+06 on 2 and 8802 DF, p-value: < 2.2e-16
##
## *** Regression call is: TempF ~ DewF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.638 -5.485 -1.570 3.601 35.456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.317877 0.188557 54.72 <2e-16 ***
## DewF 1.004504 0.004095 245.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.407 on 8803 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.8724, Adjusted R-squared: 0.8724
## F-statistic: 6.019e+04 on 1 and 8803 DF, p-value: < 2.2e-16
##
## *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.295 -2.635 -0.157 2.494 21.356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 291.341063 7.156664 40.709 < 2e-16 ***
## Altimeter -9.415029 0.237118 -39.706 < 2e-16 ***
## TempF 0.160498 0.006445 24.904 < 2e-16 ***
## DewF -0.205407 0.006912 -29.718 < 2e-16 ***
## monthFeb 0.649436 0.210063 3.092 0.001997 **
## monthMar 0.033603 0.220448 0.152 0.878850
## monthApr 0.092196 0.230608 0.400 0.689317
## monthMay -0.859361 0.258300 -3.327 0.000882 ***
## monthJun -1.564544 0.296635 -5.274 1.36e-07 ***
## monthJul -1.191767 0.314437 -3.790 0.000152 ***
## monthAug -1.319054 0.319647 -4.127 3.72e-05 ***
## monthSep 0.394263 0.299972 1.314 0.188768
## monthOct 0.511074 0.264003 1.936 0.052916 .
## monthNov 0.050679 0.235873 0.215 0.829882
## monthDec 1.741405 0.205243 8.485 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.984 on 8790 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.2494, Adjusted R-squared: 0.2482
## F-statistic: 208.6 on 14 and 8790 DF, p-value: < 2.2e-16
# Example for the basic function for San Diego, CA (2016) written to 'metar_ksan_2016_EDA.log'
ksan_2016 <- wrapCombinedEDA("metar_ksan_2016.rds",
readPath=filePath,
writeLogFile='metar_ksan_2016_EDA.log',
writeLogPDF='metar_ksan_2016_EDA.pdf',
writeLogPath=filePath
)
##
## Will use San Diego, CA (2016) as the description for metar_ksan_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.log
Finally, a function can be called to create the inputs to wrapCombinedEDA() for lower typing:
logAndPDFCombinedEDA <- function(tblName, filePath="./RInputFiles/ProcessedMETAR/") {
# Create the RDS file name
rdsName <- paste0("metar_", tblName, ".rds")
cat("\nRDS Name:", rdsName)
# Create the log file name
logName <- paste0("metar_", tblName, "_EDA.log")
cat("\nLog Name:", logName)
# Create the PDF file name
pdfName <- paste0("metar_", tblName, "_EDA.pdf")
cat("\nPDF Name:", pdfName)
# Call wrapCombinedEDA()
tbl <- wrapCombinedEDA(rdsName,
readPath=filePath,
writeLogFile=logName,
writeLogPDF=pdfName,
writeLogPath=filePath
)
# Return the tbl
tbl
}
This function can then be run for all of the relevant files (cached to avoid multiple runs):
# Run for 2016 only for kdtw, kewr, kgrb, kgrr, kiah, kind, klnk, kmke, kmsn, kmsp, ktvc
kdtw_2016 <- logAndPDFCombinedEDA("kdtw_2016")
##
## RDS Name: metar_kdtw_2016.rds
## Log Name: metar_kdtw_2016_EDA.log
## PDF Name: metar_kdtw_2016_EDA.pdf
## Will use Detroit, MI (2016) as the description for metar_kdtw_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kdtw_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kdtw_2016_EDA.log
kewr_2016 <- logAndPDFCombinedEDA("kewr_2016")
##
## RDS Name: metar_kewr_2016.rds
## Log Name: metar_kewr_2016_EDA.log
## PDF Name: metar_kewr_2016_EDA.pdf
## Will use Newark, NJ (2016) as the description for metar_kewr_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kewr_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kewr_2016_EDA.log
kgrb_2016 <- logAndPDFCombinedEDA("kgrb_2016")
##
## RDS Name: metar_kgrb_2016.rds
## Log Name: metar_kgrb_2016_EDA.log
## PDF Name: metar_kgrb_2016_EDA.pdf
## Will use Green Bay, WI (2016) as the description for metar_kgrb_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrb_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrb_2016_EDA.log
kgrr_2016 <- logAndPDFCombinedEDA("kgrr_2016")
##
## RDS Name: metar_kgrr_2016.rds
## Log Name: metar_kgrr_2016_EDA.log
## PDF Name: metar_kgrr_2016_EDA.pdf
## Will use Grand Rapids, MI (2016) as the description for metar_kgrr_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrr_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrr_2016_EDA.log
kiah_2016 <- logAndPDFCombinedEDA("kiah_2016")
##
## RDS Name: metar_kiah_2016.rds
## Log Name: metar_kiah_2016_EDA.log
## PDF Name: metar_kiah_2016_EDA.pdf
## Will use Houston, TX (2016) as the description for metar_kiah_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kiah_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kiah_2016_EDA.log
kind_2016 <- logAndPDFCombinedEDA("kind_2016")
##
## RDS Name: metar_kind_2016.rds
## Log Name: metar_kind_2016_EDA.log
## PDF Name: metar_kind_2016_EDA.pdf
## Will use Indianapolis, IN (2016) as the description for metar_kind_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kind_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kind_2016_EDA.log
klnk_2016 <- logAndPDFCombinedEDA("klnk_2016")
##
## RDS Name: metar_klnk_2016.rds
## Log Name: metar_klnk_2016_EDA.log
## PDF Name: metar_klnk_2016_EDA.pdf
## Will use Lincoln, NE (2016) as the description for metar_klnk_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klnk_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klnk_2016_EDA.log
kmke_2016 <- logAndPDFCombinedEDA("kmke_2016")
##
## RDS Name: metar_kmke_2016.rds
## Log Name: metar_kmke_2016_EDA.log
## PDF Name: metar_kmke_2016_EDA.pdf
## Will use Milwaukee, WI (2016) as the description for metar_kmke_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmke_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmke_2016_EDA.log
kmsn_2016 <- logAndPDFCombinedEDA("kmsn_2016")
##
## RDS Name: metar_kmsn_2016.rds
## Log Name: metar_kmsn_2016_EDA.log
## PDF Name: metar_kmsn_2016_EDA.pdf
## Will use Madison, WI (2016) as the description for metar_kmsn_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsn_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsn_2016_EDA.log
kmsp_2016 <- logAndPDFCombinedEDA("kmsp_2016")
##
## RDS Name: metar_kmsp_2016.rds
## Log Name: metar_kmsp_2016_EDA.log
## PDF Name: metar_kmsp_2016_EDA.pdf
## Will use Minneapolis, MN (2016) as the description for metar_kmsp_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsp_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsp_2016_EDA.log
ktvc_2016 <- logAndPDFCombinedEDA("ktvc_2016")
##
## RDS Name: metar_ktvc_2016.rds
## Log Name: metar_ktvc_2016_EDA.log
## PDF Name: metar_ktvc_2016_EDA.pdf
## Will use Traverse City, MI (2016) as the description for metar_ktvc_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.log
# Run for 2015-2016-2017 for klas, kmsy, kord, ksan
klas_2015 <- logAndPDFCombinedEDA("klas_2015")
##
## RDS Name: metar_klas_2015.rds
## Log Name: metar_klas_2015_EDA.log
## PDF Name: metar_klas_2015_EDA.pdf
## Will use Las Vegas, NV (2015) as the description for metar_klas_2015.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2015_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2015_EDA.log
klas_2016 <- logAndPDFCombinedEDA("klas_2016")
##
## RDS Name: metar_klas_2016.rds
## Log Name: metar_klas_2016_EDA.log
## PDF Name: metar_klas_2016_EDA.pdf
## Will use Las Vegas, NV (2016) as the description for metar_klas_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2016_EDA.log
klas_2017 <- logAndPDFCombinedEDA("klas_2017")
##
## RDS Name: metar_klas_2017.rds
## Log Name: metar_klas_2017_EDA.log
## PDF Name: metar_klas_2017_EDA.pdf
## Will use Las Vegas, NV (2017) as the description for metar_klas_2017.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2017_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2017_EDA.log
kmsy_2015 <- logAndPDFCombinedEDA("kmsy_2015")
##
## RDS Name: metar_kmsy_2015.rds
## Log Name: metar_kmsy_2015_EDA.log
## PDF Name: metar_kmsy_2015_EDA.pdf
## Will use New Orleans, LA (2015) as the description for metar_kmsy_2015.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2015_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2015_EDA.log
kmsy_2016 <- logAndPDFCombinedEDA("kmsy_2016")
##
## RDS Name: metar_kmsy_2016.rds
## Log Name: metar_kmsy_2016_EDA.log
## PDF Name: metar_kmsy_2016_EDA.pdf
## Will use New Orleans, LA (2016) as the description for metar_kmsy_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2016_EDA.log
kmsy_2017 <- logAndPDFCombinedEDA("kmsy_2017")
##
## RDS Name: metar_kmsy_2017.rds
## Log Name: metar_kmsy_2017_EDA.log
## PDF Name: metar_kmsy_2017_EDA.pdf
## Will use New Orleans, LA (2017) as the description for metar_kmsy_2017.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2017_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2017_EDA.log
kord_2015 <- logAndPDFCombinedEDA("kord_2015")
##
## RDS Name: metar_kord_2015.rds
## Log Name: metar_kord_2015_EDA.log
## PDF Name: metar_kord_2015_EDA.pdf
## Will use Chicago, IL (2015) as the description for metar_kord_2015.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2015_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2015_EDA.log
kord_2016 <- logAndPDFCombinedEDA("kord_2016")
##
## RDS Name: metar_kord_2016.rds
## Log Name: metar_kord_2016_EDA.log
## PDF Name: metar_kord_2016_EDA.pdf
## Will use Chicago, IL (2016) as the description for metar_kord_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2016_EDA.log
kord_2017 <- logAndPDFCombinedEDA("kord_2017")
##
## RDS Name: metar_kord_2017.rds
## Log Name: metar_kord_2017_EDA.log
## PDF Name: metar_kord_2017_EDA.pdf
## Will use Chicago, IL (2017) as the description for metar_kord_2017.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2017_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2017_EDA.log
ksan_2015 <- logAndPDFCombinedEDA("ksan_2015")
##
## RDS Name: metar_ksan_2015.rds
## Log Name: metar_ksan_2015_EDA.log
## PDF Name: metar_ksan_2015_EDA.pdf
## Will use San Diego, CA (2015) as the description for metar_ksan_2015.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2015_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2015_EDA.log
ksan_2016 <- logAndPDFCombinedEDA("ksan_2016")
##
## RDS Name: metar_ksan_2016.rds
## Log Name: metar_ksan_2016_EDA.log
## PDF Name: metar_ksan_2016_EDA.pdf
## Will use San Diego, CA (2016) as the description for metar_ksan_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.log
ksan_2017 <- logAndPDFCombinedEDA("ksan_2017")
##
## RDS Name: metar_ksan_2017.rds
## Log Name: metar_ksan_2017_EDA.log
## PDF Name: metar_ksan_2017_EDA.pdf
## Will use San Diego, CA (2017) as the description for metar_ksan_2017.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2017_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2017_EDA.log
Reload Traverse City, MI (2016) due to a previous error (short-term fix to avoid re-running a full cache):
ktvc_2016 <- logAndPDFCombinedEDA("ktvc_2016")
##
## RDS Name: metar_ktvc_2016.rds
## Log Name: metar_ktvc_2016_EDA.log
## PDF Name: metar_ktvc_2016_EDA.pdf
## Will use Traverse City, MI (2016) as the description for metar_ktvc_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.log
The files are now available for further analysis, with individual PDF files for plots associated with each locale.
With EDA about each locale saved to .pdf and .log files, it is interesting to investigate comparisons among the various locales. For example, how do the temperature patterns by month for a given locale compare to the overall global median temperature patterns by month?
The existing functions contain most of the code needed to perform this. The main steps to add are:
The first step is to combine one or more processed files, with a column added for locale:
# Combine files from a character list
combineProcessedFiles <- function(charList, mapper=cityNameMapper) {
# Combine the objects represented by charList, and name the list items using charList
listFiles <- lapply(charList, FUN=function(x) { get(x) })
names(listFiles) <- charList
# Bind rows, and add the descriptive locale name as sourceName
tblFiles <- bind_rows(listFiles, .id="source") %>%
mutate(sourceName=mapper[source])
tblFiles
}
The 2016 data will be used to run the combined process:
# Grab all the data that ends in _2016
locales2016 <- ls() %>%
grep(pattern="_2016", value=TRUE)
cat("\nLocales used will be:\n\n", paste0(locales2016, collapse="\n"), "\n\n", sep="")
##
## Locales used will be:
##
## kdtw_2016
## kewr_2016
## kgrb_2016
## kgrr_2016
## kiah_2016
## kind_2016
## klas_2016
## klnk_2016
## kmke_2016
## kmsn_2016
## kmsp_2016
## kmsy_2016
## kord_2016
## ksan_2016
## ktvc_2016
# Combine the 2016 data
all2016Data <- combineProcessedFiles(locales2016)
# Show counts by sourceName
all2016Data %>%
count(source, sourceName)
## # A tibble: 15 x 3
## source sourceName n
## <chr> <chr> <int>
## 1 kdtw_2016 Detroit, MI (2016) 8818
## 2 kewr_2016 Newark, NJ (2016) 8821
## 3 kgrb_2016 Green Bay, WI (2016) 8803
## 4 kgrr_2016 Grand Rapids, MI (2016) 8812
## 5 kiah_2016 Houston, TX (2016) 8816
## 6 kind_2016 Indianapolis, IN (2016) 8767
## 7 klas_2016 Las Vegas, NV (2016) 8818
## 8 klnk_2016 Lincoln, NE (2016) 8813
## 9 kmke_2016 Milwaukee, WI (2016) 8808
## 10 kmsn_2016 Madison, WI (2016) 8798
## 11 kmsp_2016 Minneapolis, MN (2016) 8817
## 12 kmsy_2016 New Orleans, LA (2016) 8813
## 13 kord_2016 Chicago, IL (2016) 8815
## 14 ksan_2016 San Diego, CA (2016) 8810
## 15 ktvc_2016 Traverse City, MI (2016) 8814
The following global summaries will be useful:
Helper functions can be created for:
The function helperFactorNumeric is created to apply function f to numeric variable y by factor variable x:
# Helper function to get overall percentage by metric for variable x
helperCountsByMetric <- function(tbl, ctVar, sumOn="dummyVar") {
tbl %>%
mutate(dummyVar=1) %>%
select_at(vars(all_of(c(ctVar, sumOn)))) %>%
filter_all(all_vars(!is.na(.))) %>%
group_by_at(ctVar) %>%
summarize(n=sum(get(sumOn))) %>%
mutate(nPct=n/sum(n))
}
# Example run to get counts by greatest sky obscuration
helperCountsByMetric(all2016Data, ctVar="wType")
## # A tibble: 7 x 3
## wType n nPct
## <fct> <dbl> <dbl>
## 1 VV 761 0.00576
## 2 OVC 39480 0.299
## 3 BKN 31352 0.237
## 4 SCT 16212 0.123
## 5 FEW 19304 0.146
## 6 CLR 24433 0.185
## 7 Error 601 0.00455
# Helper function to get a geom_smooth for variable y vs variable x
helperNumCor <- function(tbl,
xVar,
yVar,
sumOn="dummyVar",
se=TRUE,
color="red",
method="lm",
lty=2
) {
# Generate the overall totals for sumOn by xVar and yVar
plotData <- tbl %>%
mutate(dummyVar=1) %>%
select_at(vars(all_of(c(xVar, yVar, sumOn)))) %>%
filter_all(all_vars(!is.na(.))) %>%
group_by_at(vars(all_of(c(xVar, yVar)))) %>%
summarize(nTotal=sum(get(sumOn)))
geom_smooth(data=plotData,
aes_string(x=xVar, y=yVar, weight="nTotal"),
se=se,
color=color,
method=method,
lty=lty
)
}
# Example run to get TempC vs DewC
helperNumCor(all2016Data, xVar="TempC", yVar="DewC")
## mapping: weight = ~nTotal, x = ~TempC, y = ~DewC
## geom_smooth: na.rm = FALSE, se = TRUE
## stat_smooth: na.rm = FALSE, se = TRUE, method = lm, formula = y ~ x
## position_identity
# Example for using the helper function on a plot
plotNumCor(kdtw_2016, var1="TempC", var2="DewC") +
helperNumCor(all2016Data, xVar="TempC", yVar="DewC")
# Helper function to calculate .f(numVar) by byVar
helperFactorNumeric <- function(tbl, .f, byVar, numVar, ...) {
tbl %>%
select_at(vars(all_of(c(byVar, numVar)))) %>%
filter_all(all_vars(!is.na(.))) %>%
group_by_at(byVar) %>%
summarize(helpFN=.f(get(numVar), ...))
}
# Example for getting median TempF by month
helperFactorNumeric(all2016Data, .f=median, byVar="month", numVar="TempF")
## # A tibble: 12 x 2
## month helpFN
## <fct> <dbl>
## 1 Jan 30.9
## 2 Feb 35.1
## 3 Mar 46.9
## 4 Apr 52.0
## 5 May 64.0
## 6 Jun 73.0
## 7 Jul 77
## 8 Aug 75.9
## 9 Sep 71.1
## 10 Oct 61.0
## 11 Nov 50
## 12 Dec 34.0
The function plotCountsByMetric() has been updated above to allow for facetting and plotting of the overall central tendency. Two examples are shown - the base from previous, and a facetted example:
# Previous Example for Detroit 2016 - using WindDir, cType1, month, wType
plotcountsByMetric(kdtw_2016,
mets=c("WindDir", "cType1", "month", "wType"),
title="Detroit, MI (2016)",
dropNA=TRUE,
diagnose=TRUE
)
##
## Dropping 1 rows with 49 observations due to NA
# Facetted example for kdtw_2016, kord_2016, klas_2016, ksan_2016
useData <- all2016Data %>%
filter(source %in% c("kdtw_2016", "klas_2016", "kord_2016", "ksan_2016"))
plotcountsByMetric(useData,
mets=c("WindDir", "cType1", "month", "wType"),
title="Comparison Across Locales (red dots are the median)",
dropNA=TRUE,
diagnose=TRUE,
facetOn="sourceName",
showCentral=TRUE
)
##
## Dropping 4 rows with 117 observations due to NA
As observed previously, Las Vegas tends towards southerly winds while San Diego tends towards northwesterly winds and calm (direction 000) winds. Detroit is most likely to be overcast, while Las Vegas is most likely to be clear.
The function plotNumCor() has been updated above to allow for facetting and plotting of the overall central tendency. Two examples are shown - the base from previous, and a facetted example:
# Example for Detroit 2016 - using TempC and DewC
plotNumCor(kdtw_2016, var1="TempC", var2="DewC", subT="Detroit, MI (2016)", diagnose=TRUE)
##
## Dropping 1 rows with 49 observations due to NA
# Facetted example for kdtw_2016, kord_2016, klas_2016, ksan_2016
useData <- all2016Data %>%
filter(source %in% c("kdtw_2016", "klas_2016", "kord_2016", "ksan_2016"))
# Facetted plot for very highly correlated variables TempC and TempF
plotNumCor(useData,
var1="TempC",
var2="TempF",
subT="Comparison Across Locales (red dashed lines are the overall)",
dropNA=TRUE,
diagnose=TRUE,
facetOn="sourceName",
showCentral=TRUE
)
##
## Dropping 4 rows with 117 observations due to NA
# Facetted plot for highly correlated variables TempF and DewF
plotNumCor(useData,
var1="TempF",
var2="DewF",
subT="Comparison Across Locales (red dashed lines are the overall)",
dropNA=TRUE,
diagnose=TRUE,
facetOn="sourceName",
showCentral=TRUE
)
##
## Dropping 4 rows with 117 observations due to NA
The differences in the relationships between temperature and dew point stand out:
In contrast, of course, temperatures measured in C and F all follow the same pattern regardless of city
The function plotFactorNumeric() has been updated above to allow for facetting and plotting of the overall central tendency. Two examples are shown - the base from previous, and a facetted example:
# Example for Detroit 2016 - using TempF and month
plotFactorNumeric(kdtw_2016,
fctVar="month",
numVar="TempF",
subT="Detroit, MI (2016)",
showXLabel=FALSE,
diagnose=TRUE
)
##
## Removing 49 records due to NA
# Facetted example for kdtw_2016, kord_2016, klas_2016, ksan_2016
useData <- all2016Data %>%
filter(source %in% c("kdtw_2016", "klas_2016", "kord_2016", "ksan_2016"))
plotFactorNumeric(useData,
fctVar="month",
numVar="TempF",
subT="Overall median shown as red dot",
showXLabel=FALSE,
diagnose=TRUE,
facetOn="sourceName",
showCentral=TRUE
)
##
## Removing 117 records due to NA
Las Vegas consistently runs above the overall median temperature, while Chicago and Detroit run below the overall median temperature, particularly during the cold season. San Diego has little temperature variation by month and is thus below the median in the warm season and above the median in the cold season.
It is now possible to re-run the EDA plotting routines, focused on the 2016 data:
# Create rounded TempF and DewF in all2016Data
all2016Data <- all2016Data %>%
mutate(TempF5=5*round(round(TempF)/5),
DewF5=5*round(round(DewF)/5),
WindSpeed5=5*round(WindSpeed/5),
Altimeter10=round(Altimeter, 1)
)
# Counts by Metric for all 2016 data
plotcountsByMetric(all2016Data,
mets=c("month", "year",
"WindDir", "WindSpeed5",
"Visibility", "Altimeter10",
"TempF5", "DewF5",
"wType"
),
title="Comparisons Across Locales (red dots are the median)",
facetOn="sourceName",
showCentral=TRUE
)
The cross-locale comparisons bring out a few salient features:
DATA VOLUMES:
* All locales have roughly the same amount of data by year and month, focused on 2016 and with 1-2 days on either side
WIND DIRECTION and WIND SPEED:
* Las Vegas has an excess of no/variable wind and of southerly winds, both appropriate for a desert
* Houston has an excess of no wind and southerly winds, both appropriate for the Gulf Coast
* San Diego has an excess of no wind and of northwesterly wind, both appropriate for the Pacific coast
* Chicago, Grand Rapids, Indianapolis, Minneapolis, and Newark all have lower occurences of no wind, appropriate for relatively cold mid-latitude cities
* Lincoln looks “about normal”, with the exception that it has slightly more southerly winds; Lincoln is the only Great Plains locale in the analysis, and it appears to be a blend of Gulf Coast and Wintry
* Detroit, Green Bay, Milwaukee, and New Orleans all look “about average”; this is not surprising in the first three cases given the predominance of cold, mid-latitude locales, but is surprising for New Orleans
* Madison and Traverse City are surprising in that both show a predominance of no/variable winds; this is uncommon for cold-weather cities in the mid-latitude and merits further examination
VISIBILITY:
* The overwhelming majority of visibilities are 10SM (the highest that is recorded; more or less means unlimited in the METAR)
* There is a data issue with a Visibility > 10 that should be addressed
* Las Vegas is slightly more likely than most to have unlimited Visibility and Detroit is slightly less likely than most to have unlimited visibility
ALTIMETER:
* Las Vegas skews low as appropriate for a high-altitude desert locale
* New Orleans and Houston show less variance, perhaps driven by being roughly at sea level and in close proximity to the Gulf of Mexico
* San Diego shows very low variance, perhaps driven by being roughly at sea level and in close proximity to the Pacific Ocean
TEMPERATURE:
* Houston, Las Vegas, and New Orleans skew warm as expected
* San Diego has ver low variance as expected
* At a gross level, the other cities look similar to the median, likely driven by the predominance of cold, mid-latitude locales in the data file
DEW POINT:
* Houston and New Orleans skew very high as expected
* Las Vegas skews very low as expected
* San Diego has very low variance as expected
SKY OBSCURATION:
* Lincoln and Green Bay are the most likely to be CLR (clear, no clouds on the automated sensor). This may be driven by a difference in maximum sensor heights, and is unexpected in Green Bay which should be frequently cloudy due to its latitude and proximity to a large body of water
* Detroit, Traverse City, Grand Rapids, and Minneapolis are especially likely to be overcast
* Las Vegas is especially likely to have few clouds or to be clear
Comparisons are run for a few of the numerical correlations:
# Example for 2016 - using mixes of WindSpeed, Altimeter, TempF, DewF, TempC, DewC
numCorList <- list(c("TempC", "TempF"),
c("DewC", "DewF"),
c("TempF", "DewF"),
c("Altimeter", "WindSpeed"),
c("Altimeter", "TempF")
)
# Run the list through plotNumCor()
for (x in numCorList) {
plotNumCor(all2016Data,
var1=x[1],
var2=x[2],
subT="Red dashed line is the overall slope",
diagnose=TRUE,
facetOn="sourceName",
showCentral=TRUE
)
}
##
## Dropping 15 rows with 593 observations due to NA
##
## Dropping 15 rows with 593 observations due to NA
##
## Dropping 15 rows with 593 observations due to NA
##
## Dropping 15 rows with 593 observations due to NA
##
## Dropping 15 rows with 593 observations due to NA
The cross-locale comparisons bring out a few salient features:
FAHRENHEIT AND CELSIUS:
* As expected, TempF/C are perfectly correlated and DewF/C are perfectly correlated. Since the observations were taken in the US, the TempF/DewF data will be used (TempC/DewC are conversions from the measured TempF/DewF to match the international standard for METAR reporting)
TEMPERATURE AND DEW POINT:
* While many cities have different clusters of temperature/dew point, all but Las Vegas and San Diego follow a pattern where the temperature and the dew point tend to run together at a similar rate
* In Las Vegas, the dew point is largely independent of the temperature
* In San Diego, there is less variance in temperature and dew point, and a lower (but still obvious) tendency for tenmperature and dew point to rise/fall together
ALTIMETER AND WIND SPEED:
* As expected, when the altimeter rises, on average, the wind speed falls
* This tendency is less pronounced in Houston and New Orleans; and more pronounced in Las Vegas, Dan Diego, and Indianapolis
ALTIMETER AND TEMPERATURE:
* Overall, low temperatures and high altimeters tend to be observed together
* This is especially so in Houston, Las Vegas, and New Orleans
* This is more modest in Grand Rapids, Green Bay, and Traverse City
Comparisons are then run for numeric variables against factor variables:
# Modify windDir so that it is just N, NE, E, SE, S, SW, W, NW, 000, Variable
mod2016Data <- all2016Data %>%
mutate(tempDir=ifelse(is.na(WindDir) | WindDir %in% c("000", "VRB"), -1, as.numeric(WindDir)),
predomDir=factor(case_when(is.na(WindDir) ~ "Error",
WindDir=="000" ~ "000",
WindDir=="VRB" ~ "VRB",
tempDir >= 337.5 ~ "N",
tempDir <= 22.5 ~ "N",
tempDir <= 67.5 ~ "NE",
tempDir <= 112.5 ~ "E",
tempDir <= 157.5 ~ "SE",
tempDir <= 202.5 ~ "S",
tempDir <= 247.5 ~ "SW",
tempDir <= 292.5 ~ "W",
tempDir <= 337.5 ~ "NW",
TRUE ~ "Error"
),
levels=c("Error", "000", "VRB", "NE", "E", "SE", "S", "SW", "W", "NW", "N")
)
)
## Warning in ifelse(is.na(WindDir) | WindDir %in% c("000", "VRB"), -1,
## as.numeric(WindDir)): NAs introduced by coercion
# Key factor variables include month, wType, predomDir
# Key numeric variables include WindSpeed, Altimeter, TempF, DewF, Visibility
fctNumList <- list(c("month", "WindSpeed"),
c("month", "Altimeter"),
c("month", "TempF"),
c("month", "DewF"),
c("month", "Visibility"),
c("wType", "WindSpeed"),
c("wType", "Altimeter"),
c("wType", "Visibility"),
c("predomDir", "WindSpeed"),
c("predomDir", "Altimeter"),
c("predomDir", "TempF"),
c("predomDir", "DewF")
)
for (x in fctNumList) {
plotFactorNumeric(mod2016Data,
fctVar=x[1],
numVar=x[2],
subT="Red dots are the overall average",
showXLabel=FALSE,
diagnose=TRUE,
facetOn="sourceName",
showCentral=TRUE
)
}
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
The cross-locale comparisons bring out a few salient points:
WIND SPEED BY MONTH:
* The plot is too busy; need to rethink, potentially by removing the outliers and plotting only the box
ALTIMETER BY MONTH:
* Same as above; this plot is too busy
TEMPERATURE BY MONTH:
* Seasonal patterns are observed in most of the data, with a warm season centered around July and a cold season centered around December
* Las Vegas stand out for running warmer than the overall average in every month
* Houston, New Orleans, and San Diego stand out for running warmer than the overall average in the cold season and similar to the overall average in the warm season
DEW POINT BY MONTH:
* Seasonal patterns are observed in most of the data, with the humid seasons tracking with the warm seasons
* Houston and New Orleans run consistently above the average dew point by month
* San Diego runs above the average dew point during the cold season
* Las Vegas runs below the average dew point during the warm season
VISIBILITY BY MONTH:
* Visibilities are overwhelmingly likely to be 10 SM
* The Newark, NJ outlier described earlier (19 SM) should be deleted
* Detroit is especially likely to have Visibility less than 10 SM
* Grand Rapids and Traverse City have meaningul occurences of Visibility less than 10 SM during the cold season
WIND SPEED BY SKY OBSCURATION:
* This chart is too busy as per above
ALTIMETER BY SKY OBSCURATION:
* Not much at a glance
VISIBILITY BY SKY OBSCURATION:
* The VV and OVC sky obscurations are most associated with low visibilities; VV in particular is almost always associated with very low visibility
WIND SPEED BY WIND DIRECTION:
* Wind direction “000” is always associated with wind speed 0, as expected
* Wind direction “VRB” is always associated with a low but non-zero wind speed, as expected
* While some cities are windier than others, there is no pronounced tendency for wind speed to be highly associated with a given wind direction in any locale
ALTIMETER BY WIND DIRECTION:
* No gross trends observed
* The plot is rather busy, especially given the low variance in median/IQR for altimeter relative to the outlier points
TEMPERATURE/DEW POINT BY WIND DIRECTION:
* Plot is not great for reading and interpreting
Next steps are to modify a few of the plots for better interpretability (less busy, more variation of the core data metric relative to the full y-axis, etc.), investigate and rectify the data issues observed, and save a version of the file for further analysis.
A modified boxplot function is created to plot only the median and the IQR, with the goal of having more of the variance in the median visible on the plot:
plotMedianIQR <- function(met,
fctVar,
numVar,
title=NULL,
subT="",
mid=0.5,
rng=c(0.25, 0.75),
diagnose=TRUE,
showXLabel=TRUE,
mapper=varMapper,
facetOn=NULL,
showCentral=FALSE,
ylimits=NULL
) {
# Function arguments
# met: dataframe or tibble containing raw data
# fctVar: character vector of variable to be used for the x-axis (factor in the boxplot)
# numVar: character vector of variable to be used for the y-axis (numeric in the boxplot)
# title: character vector for plot title
# subT: character vector for plot subtitle
# mid: float between 0 and 1 for the quantile to be used as the midpoint
# rng: length-two float vector for (lo, hi) to be used as the dimensions of the box
# diagnose: boolean for whether to note in the log the number of NA observations dropped
# showXLabel: boolean for whether to include the x-label (e.g., set to FALSE if using 'month')
# mapper: named list containing mapping from variable name to well-formatted name for titles and axes
# facetOn: a facetting variable for the supplied met (NULL for no faceting)
# showCentral: boolean for whether to show the central tendency over-plotted on the main data
# ylimits: length-two numeric for the y-axis minimum and maximum (default NULL uses plot defaults)
# Function usage
# 1. By default, the function creates a modified boxplot of numVar by fctVar - line at mid, box going from rng[1] to rng[2]
# 2. If facetOn is passed as a non-NULL, then the data in #1 will be facetted by facetOn
# 3. If showCentral=TRUE, then the overall median of numVar by fctVar will be plotted as a red dot
# Check that the quantile variables are sensible
if (length(mid) != 1 | length(rng) != 2) {
stop("Must pass a single value as mid and a length-two vector as rng\n")
}
if (min(c(mid, rng)) < 0 | max(c(mid, rng)) > 1) {
stop("All values of mid and rng must be between 0 and 1, inclusive\n")
}
if ((mid < rng[1]) | (mid > rng[2])) {
stop("mid must be at least as big as rng[1] and no greater than rng[2]\n")
}
quants <- paste0(round(100*c(rng[1], mid, rng[2]), 0), "%", collapse=" ")
# Create the title if not passed
if (is.null(title)) {
title <- paste0("Hourly Observations of ", mapper[numVar], " by ", mapper[fctVar])
}
# Remove the NA variables
nOrig <- nrow(met)
dat <- met %>%
filter(!is.na(get(fctVar)), !is.na(get(numVar)))
if (diagnose) { cat("\nRemoving", nOrig-nrow(dat), "records due to NA\n") }
# Create the quantile data by fctVar and (if passed) facetOn
groupVars <- fctVar
if (!is.null(facetOn)) { groupVars <- c(groupVars, facetOn) }
plotData <- dat %>%
group_by_at(groupVars) %>%
summarize(midPoint=quantile(get(numVar), probs=mid),
loPoint=quantile(get(numVar), probs=rng[1]),
hiPoint=quantile(get(numVar), probs=rng[2])
)
# Create the base plot
p <- plotData %>%
ggplot(aes_string(x=fctVar, y="midPoint")) +
geom_crossbar(aes(ymin=loPoint, ymax=hiPoint), fill="lightblue") +
labs(title=title,
subtitle=subT,
x=ifelse(showXLabel, paste0(mapper[fctVar], " - ", fctVar), ""),
y=paste0(mapper[numVar], " - ", numVar),
caption=paste0("Quantiles plotted: ", quants)
)
# If facetting has been requested, facet by the desired variable
if (!is.null(facetOn)) {
p <- p + facet_wrap(as.formula(paste("~", facetOn)))
}
# If showCentral=TRUE, add a dot plot for the overall value of 'mid'
if (showCentral) {
centData <- helperFactorNumeric(dat, .f=quantile, byVar=fctVar, numVar=numVar, probs=mid)
p <- p + geom_point(data=centData, aes(y=helpFN), size=2, color="red")
}
# If ylim has been passed, use it
if (!is.null(ylimits)) {
p <- p + ylim(ylimits)
}
# Render the final plot
print(p)
}
The function can then be applied in an attempt to get a better look at a few of the comparisons:
# Key factor variables include month, wType, predomDir
# Key numeric variables include WindSpeed, Altimeter, TempF, DewF, Visibility
fctNumListIQR <- list(c("month", "WindSpeed"),
c("month", "Altimeter"),
c("wType", "WindSpeed"),
c("predomDir", "Altimeter"),
c("predomDir", "TempF"),
c("predomDir", "DewF")
)
for (x in fctNumListIQR) {
plotMedianIQR(mod2016Data,
fctVar=x[1],
numVar=x[2],
subT="Red dots are the overall mid-quantile",
showXLabel=FALSE,
diagnose=TRUE,
facetOn="sourceName",
showCentral=TRUE
)
}
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
Limiting the observations to Q1-Median-Q2 brings out a bit more information in the plots:
WIND SPEED BY MONTH:
* San Diego and Las Vegas are meaningfully less windy than the median, especially during the cold season
ALTIMETER BY MONTH:
* San Diego has very little variance in altimeter relative to the other locales
* Las Vegas runs especially low on altimeter during the warm season
* There is a seasonal pattern where altimeters tend to run lower during the warm season while also showing a much smaller Q1-Q3 range
WIND SPEED BY SKY OBSCURATION:
* Newark shows very high wind speeds when obscuration is VV
* Should re-run with data y-axis capped at 15 to visualize the other locales
ALTIMETER BY WIND DIRECTION:
* Altimeters tend to be a bit lower when winds are from the S-SW-W-NW, a pattern that seems consistent across all locales
* Las Vegas runs especially low on altimeter relative to other locales when winds are from E-SE-S
TEMPERATURE BY WIND DIRECTION:
* Variable winds are associated with higher temperatures than calm (speed 0) winds
* Winds from W-NW are generally associated with somewhat lower temperatures
DEW POINT BY WIND DIRECTION:
* Winds from SW-W-NW-N are associated with generally lower dew points than winds from NE-E-SE-S
* Las Vegas typically runs low on dew points while Houston and New Orleans typically run high on dew points
* San Diego has little variance in dew points and does not see any dip when winds are from SW-W-NW-N
The wind speed vs. sky obscuration plot is re-run with y-limits that suppress Newark VV:
plotMedianIQR(mod2016Data,
fctVar="wType",
numVar="WindSpeed",
subT="Red dots are the overall mid-quantile",
showXLabel=FALSE,
diagnose=TRUE,
facetOn="sourceName",
showCentral=TRUE,
ylimits=c(0, 15)
)
##
## Removing 593 records due to NA
## Warning: Removed 1 rows containing missing values (geom_crossbar).
Low vertical visibilities (VV) are generally associated with lower winds. There is also a slight tendency for more obscured skies (OVC) to be associated with slightly stronger winds.
Further, the previous analysis for counts by wind direction is re-run using the predominant directions:
# Counts by Metric for predomDir using mod2016Data
plotcountsByMetric(mod2016Data,
mets=c("predomDir"),
title="Comparisons Across Locales (red dots are the median)",
facetOn="sourceName",
showCentral=TRUE
)
Findings include:
* Across all locales, winds are more commonly from S-SW-W-NW-N than from NE-E-SE
* San Diego is especially likely to have either 0 wind or wind from W-NW
* Traverse City is especially likely to have 0 wind or variable wind (very surprising, and possibly a sign of anomalous data)
* Minneapolis is especially likely to experience winds from NE
* New Orleans and Houston rarely experience winds from SW-W-NW relative to other locales
* Lincoln and Las Vegas are especially pronse to winds from S (and for Las Vegas SW)
There are several issues identified that should be explored and fixed if appropriate:
Cloud exploration can highlight whether all locales are being compared apples to apples. This is especially the case if some locales have a different maximum sensor height:
# Select the source, sourceName, dtime and cLevel variables; pivot cLevel down
cLevels <- mod2016Data %>%
select(source, sourceName, dtime, starts_with("cLevel")) %>%
pivot_longer(-c(source, sourceName, dtime), names_to="level", values_to="height") %>%
mutate(level=as.integer(str_replace(level, pattern="cLevel", replacement="")))
# Select the source, sourceName, dtime and cType variables; pivot cLevel down
cTypes <- mod2016Data %>%
select(source, sourceName, dtime, starts_with("cType")) %>%
pivot_longer(-c(source, sourceName, dtime), names_to="level", values_to="type") %>%
mutate(level=as.integer(str_replace(level, pattern="cType", replacement="")))
cData <- cLevels %>%
inner_join(cTypes, by=c("source", "sourceName", "dtime", "level"))
# Plot cloud heights, using only non-NA
cData %>%
filter(!is.na(height)) %>%
ggplot(aes(x=fct_reorder(sourceName, height, .fun=max, na.rm=TRUE), y=height)) +
geom_violin(fill="lightblue") +
coord_flip() +
labs(x="", y="Cloud Height (feet)", title="Density of cloud heights by locale")
There are clearly differences in maximum cloud height recorded by locale:
The distribution of cloud types observed can also be assessed:
# Plot cloud heights, using only non-""
fctLayers <- c("VV", "OVC", "BKN", "SCT", "FEW")
cData %>%
filter(type!="") %>%
mutate(type=factor(type, levels=fctLayers)) %>%
ggplot(aes(x=fct_reorder(sourceName, height, .fun=max, na.rm=TRUE), fill=type)) +
geom_bar(position="stack") +
coord_flip() +
labs(x="", y="Cloud Layer Obscuration", title="Cloud obscuration by locale") +
scale_fill_discrete("", rev(fctLayers)) +
theme(legend.position="bottom")
Lincoln and Green Bay stand out for having fewer clouds, likely due to the inability to catch the higher altocumulus and any cirrus cloud (both very common in the mid-latitudes).
Supposing that only clouds of 12,000 feet and under are considered:
# Plot cloud heights, using only non-NA
cData %>%
filter(!is.na(height)) %>%
filter(height <= 12000) %>%
ggplot(aes(x=sourceName, y=height)) +
geom_violin(fill="lightblue") +
coord_flip() +
labs(x="", y="Cloud Height (feet)", title="Density of cloud heights by locale")
cData %>%
filter(type!="") %>%
filter(height <= 12000) %>%
mutate(type=factor(type, levels=fctLayers)) %>%
ggplot(aes(x=fct_reorder(sourceName, sourceName, .fun=length), fill=type)) +
geom_bar(position="stack") +
coord_flip() +
labs(x="", y="Cloud Layer Obscuration", title="Cloud obscuration by locale (up to 12,000 feet)") +
scale_fill_discrete("", rev(fctLayers)) +
theme(legend.position="bottom")
The patterns are much more plausible:
Further investigation of the cloud data may be interesting.
A function is written to take only cloud data up through height x, and to add a layer of clouds that are “clear” at a height that is out-of-interval:
# Filter to only clouds up to and including height
cloudsLevel0 <- function(df,
maxHeight,
byVars,
baseLevel=0,
heightBase=-100,
typeBase="CLR"
) {
# Function assumptions
# Input data are unique by byVars-level and with columns 'height' and 'type'
# Clouds increase in height with level
# Clouds are non-decreasing in type (VV > OVC > BKN > SCT > FEW) with level
# FUNCTION ARGUMENTS:
# df: tibble or dataframe contiaining the clouds data
# maxHeight: the maximum height to consider (delete all heights above this level)
# byVars: the variables that make up a unique observation (df should be unique by byVars-level)
# baseLevel: the level to be created as the base level (by default, level 0)
# heightBase: the height to be provided to the base level (by default, -100 feet)
# typeBase: the type of obscuration observed at the base level (by default, CLR)
# Add a cloud level 0 that has height -100 (by default)
# Include only levels where the cloud height is not NA
# Include only levels where the cloud height is less than or equal to maxHeight
modData <- df %>%
group_by_at(vars(all_of(byVars))) %>%
summarize(level=baseLevel, height=heightBase, type=typeBase) %>%
ungroup() %>%
bind_rows(df) %>%
arrange_at(vars(all_of(c(byVars, "level")))) %>%
filter(!is.na(height)) %>%
filter(height <= maxHeight)
modData
}
The function is then run using the existing cData:
modCData <- cloudsLevel0(cData, maxHeight=12000, byVars=c("source", "sourceName", "dtime"))
modCData <- modCData %>%
mutate(type=factor(type, levels=c("VV", "OVC", "BKN", "SCT", "FEW", "CLR")))
modCData
## # A tibble: 260,324 x 6
## source sourceName dtime level height type
## <chr> <chr> <dttm> <dbl> <dbl> <fct>
## 1 kdtw_2016 Detroit, MI (2016) 2015-12-31 00:53:00 0 -100 CLR
## 2 kdtw_2016 Detroit, MI (2016) 2015-12-31 00:53:00 1 2500 BKN
## 3 kdtw_2016 Detroit, MI (2016) 2015-12-31 00:53:00 2 5000 OVC
## 4 kdtw_2016 Detroit, MI (2016) 2015-12-31 01:53:00 0 -100 CLR
## 5 kdtw_2016 Detroit, MI (2016) 2015-12-31 01:53:00 1 2100 OVC
## 6 kdtw_2016 Detroit, MI (2016) 2015-12-31 02:53:00 0 -100 CLR
## 7 kdtw_2016 Detroit, MI (2016) 2015-12-31 02:53:00 1 2100 OVC
## 8 kdtw_2016 Detroit, MI (2016) 2015-12-31 03:53:00 0 -100 CLR
## 9 kdtw_2016 Detroit, MI (2016) 2015-12-31 03:53:00 1 2500 OVC
## 10 kdtw_2016 Detroit, MI (2016) 2015-12-31 04:53:00 0 -100 CLR
## # ... with 260,314 more rows
An additional function is written to take processed cloud data and to designate 1) the minimum cloud height, 2) the minimum cloud ceiling (a ceiling exists with BKN, OVC, or VV layers), and 3) the maximum obscuration level (using CLR if no clouds exist in that area):
# Helper function to pull out the minimum cloud height, limited to certain obscurations
getMinimumHeight <- function(df,
byVars,
types,
baseLevel=0
) {
# Split the data in to the baseLevel and all other levels
baseData <- df %>%
filter(level==baseLevel)
layerData <- df %>%
filter(level!=baseLevel)
# Take the layerData, limit to type in types, and find the minimum level
layerData <- layerData %>%
filter(type %in% types) %>%
group_by_at(vars(all_of(byVars))) %>%
filter(level==min(level)) %>%
ungroup()
# Put the data back together
# Keep the maximum level for each set of byVars (will be 0 if no data for byVars in layerData)
cloudData <- baseData %>%
bind_rows(layerData) %>%
arrange_at(vars(all_of(c(byVars, "level")))) %>%
group_by_at(vars(all_of(byVars))) %>%
filter(level==max(level))
}
# Extract the minimum cloud height, minimum ceiling height, and maximum obscuration
hgtCeilObsc <- function(df,
byVars,
baseLevel=0
) {
# Function assumptions
# Input data are unique by byVars-level and with columns 'height' and 'type'
# For each byVars, a row with level=baseLevel, height=heightBase, type=typeBase has been created
# Clouds increase in height with level
# Clouds are non-decreasing in type (VV > OVC > BKN > SCT > FEW) with level
# FUNCTION ARGUMENTS:
# df: tibble or dataframe contiaining the clouds data
# byVars: the variables that make up a unique observation (df should be unique by byVars-level)
# baseLevel: the base level in df (by default, level 0)
# heightBase: the height of the base level in df (by default, -100 feet)
# typeBase: the type of obscuration at the base level in df (by default, CLR)
# Get the maximum obscuration
maxObsc <- df %>%
group_by_at(vars(all_of(byVars))) %>%
filter(level==max(level)) %>%
ungroup()
# Get the minimum height (any type)
minHeight <- getMinimumHeight(df,
byVars=byVars,
types=c("VV", "OVC", "BKN", "SCT", "FEW"),
baseLevel=baseLevel
)
# Get the minimum ceiling height (VV, OVC, BKN)
minCeiling <- getMinimumHeight(df,
byVars=byVars,
types=c("VV", "OVC", "BKN"),
baseLevel=baseLevel
)
# Put the file together
minCeiling <- minCeiling %>%
rename(ceilingHeight=height, ceilingType=type, ceilingLevel=level)
minHeight <- minHeight %>%
rename(cloudHeight=height, cloudType=type, cloudLevel=level)
maxObsc <- maxObsc %>%
rename(obscHeight=height, obscType=type, obscLevel=level)
# Merge
cloudSummary <- maxObsc %>%
full_join(minHeight, by=byVars) %>%
full_join(minCeiling, by=byVars)
cloudSummary
}
The function can then be run on the modified clouds data:
# Get the key clouds data
cloudSummary <- hgtCeilObsc(modCData, byVars=c("source", "sourceName", "dtime"))
# Check for consistency
cloudSummary %>%
count(ceilingType, obscType)
## # A tibble: 7 x 3
## ceilingType obscType n
## <fct> <fct> <int>
## 1 VV VV 761
## 2 OVC OVC 22654
## 3 BKN OVC 10054
## 4 BKN BKN 17198
## 5 CLR SCT 12456
## 6 CLR FEW 22711
## 7 CLR CLR 46309
cloudSummary %>%
count(cloudType, ceilingType)
## # A tibble: 10 x 3
## cloudType ceilingType n
## <fct> <fct> <int>
## 1 VV VV 761
## 2 OVC OVC 17900
## 3 BKN BKN 15436
## 4 SCT OVC 1848
## 5 SCT BKN 4936
## 6 SCT CLR 8243
## 7 FEW OVC 2906
## 8 FEW BKN 6880
## 9 FEW CLR 26924
## 10 CLR CLR 46309
cloudSummary %>%
count(cloudType, obscType)
## # A tibble: 12 x 3
## cloudType obscType n
## <fct> <fct> <int>
## 1 VV VV 761
## 2 OVC OVC 17900
## 3 BKN OVC 6384
## 4 BKN BKN 9052
## 5 SCT OVC 3444
## 6 SCT BKN 3340
## 7 SCT SCT 8243
## 8 FEW OVC 4980
## 9 FEW BKN 4806
## 10 FEW SCT 4213
## 11 FEW FEW 22711
## 12 CLR CLR 46309
Plots for the maximum obscuration (through 12,000 feet) can then be created:
plotMaxObsc <- function(df,
xVar,
fillVar,
title,
subtitle="Up to and including 12,000 feet",
orderByVariable=NULL,
orderByValue=NULL,
posnBar="stack",
yLabel="# Hourly Observations",
legendLabel="",
facetOn=NULL
) {
# Get the levels to be used
cLevels <- levels(df %>% pull(fillVar))
# Create the main plot
p1 <- df %>%
ggplot(aes_string(fill=fillVar))
if (!is.null(orderByVariable)) {
p1 <- p1 +
geom_bar(aes(x=fct_reorder(get(xVar), get(orderByVariable)==orderByValue, .fun=sum)),
position=posnBar
)
} else {
p1 <- p1 +
geom_bar(aes_string(x=xVar), position=posnBar)
}
p1 <- p1 +
coord_flip() +
labs(x="",
y=yLabel,
title=title,
subtitle=subtitle
) +
theme(legend.position="bottom") +
scale_fill_discrete(legendLabel, rev(cLevels)) +
guides(fill=guide_legend(nrow=1))
if (!is.null(facetOn)) {
p1 <- p1 + facet_wrap(as.formula(paste("~", facetOn)))
}
print(p1)
}
# Cloud obscuration by source
plotMaxObsc(cloudSummary,
xVar="sourceName",
fillVar="obscType",
title="Maximum Cloud Obscuration",
orderByVariable="obscType",
orderByValue="CLR"
)
# Cloud obscuration by month
cloudSummary <- cloudSummary %>%
mutate(month=lubridate::month(dtime),
hour=lubridate::hour(dtime),
monthfct=factor(month.abb[month], levels=month.abb[1:12])
)
plotMaxObsc(cloudSummary,
xVar="monthfct",
fillVar="obscType",
title="Maximum Cloud Obscuration",
posnBar="fill"
)
A few salient observations stand out about the maximum obscuration level:
Groups of cities can be examined, faceted by month:
cityCloudList <- list(c("klas_2016", "ksan_2016", "kiah_2016", "kmsy_2016"),
c("kgrb_2016", "kgrr_2016", "kdtw_2016", "ktvc_2016"),
c("klnk_2016", "kmsp_2016", "kmsn_2016", "kind_2016"),
c("kmke_2016", "kord_2016", "kewr_2016")
)
for (x in cityCloudList) {
cloudUse <- cloudSummary %>%
filter(source %in% x)
plotMaxObsc(cloudUse,
xVar="monthfct",
fillVar="obscType",
title="Maximum Cloud Obscuration",
facetOn="sourceName",
posnBar="fill"
)
}
A few findings include:
The ceiling heights can also be assessed:
cloudSummary <- cloudSummary %>%
mutate(ceilFactor=factor(case_when(ceilingHeight == -100 ~ "None",
ceilingHeight <= 1000 ~ "0-1000",
ceilingHeight <= 3000 ~ "1000-3000",
ceilingHeight <= 6000 ~ "3000-6000",
ceilingHeight <= 12000 ~ "6000-12000"
),
levels=c("None", "6000-12000", "3000-6000", "1000-3000", "0-1000")
)
)
plotMaxObsc(cloudSummary,
xVar="sourceName",
fillVar="ceilFactor",
title="Ceiling Height",
orderByVariable="ceilFactor",
orderByValue="None"
)
for (x in cityCloudList) {
cloudUse <- cloudSummary %>%
filter(source %in% x)
plotMaxObsc(cloudUse,
xVar="monthfct",
fillVar="ceilFactor",
title="Ceiling Height (proportion of hourly observations)",
facetOn="sourceName",
posnBar="fill",
yLabel="",
legendLabel="Ceiling Height (feet)"
)
}
Findings for ceiling height broadly line up with findings for maximum cloud obscuration, as expected:
The same analysis can be run for the minimum cloud height:
cloudSummary <- cloudSummary %>%
mutate(minCFactor=factor(case_when(cloudHeight == -100 ~ "None",
cloudHeight <= 1000 ~ "0-1000",
cloudHeight <= 3000 ~ "1000-3000",
cloudHeight <= 6000 ~ "3000-6000",
cloudHeight <= 12000 ~ "6000-12000"
),
levels=c("None", "6000-12000", "3000-6000", "1000-3000", "0-1000")
)
)
plotMaxObsc(cloudSummary,
xVar="sourceName",
fillVar="minCFactor",
title="Minimum Cloud Height",
orderByVariable="minCFactor",
orderByValue="None"
)
for (x in cityCloudList) {
cloudUse <- cloudSummary %>%
filter(source %in% x)
plotMaxObsc(cloudUse,
xVar="monthfct",
fillVar="minCFactor",
title="Minimum Cloud Height (proportion of hourly observations)",
facetOn="sourceName",
posnBar="fill",
yLabel="",
legendLabel="Minimum Cloud Height (feet)"
)
}
At a glance, there seem to be some similarities and differences among the locales:
A more analytical approach would look at the distance between the various locales. Broadly speaking, distance can be calculated based on counts by month by locale. No scaling is performed since each location has roughly the same number of observations by month, and the intent is to find macro similarities (e.g., if 1 city has 2% data in bucket x and the others all have 0% data in bucket x, this analysis would treat that as a minor difference where with scaled data it would be a primary difference driver).
There are three main variable to consider for distance:
Distances can be calculated using any or all of these variables. A function is created to that process can be repeated:
findCloudDist <- function(df,
byVar,
fctVar,
pivotVar="monthfct",
scaleDistData=FALSE,
returnPivotOnly=FALSE
) {
# FUNCTION ARGUMENTS
# df: data frame or tibble containing one record per locale and time
# byVar: the variable(s) by which the final distance file should be unique
# fctVar: the factor variables to be counted up, with the sums being the distance inputs
# pivotVar: the variable(s) which should be pivoted in to columns to make the file unique by byVar
# scale: whether to scale the data prior to calculating distances
# returnDistMatrix: whether to just return the pivoted data frame (default, FALSE, returns the distance matrix calculated from this pivoted data frame rather than the data frame itself)
# Create the counts data by byVar, pivoted by fctVar and pivotVar
baseData <- df %>%
group_by_at(vars(all_of(c(byVar, fctVar, pivotVar)))) %>%
summarize(n=n()) %>%
ungroup() %>%
pivot_wider(names_from=all_of(c(fctVar, pivotVar)), values_from="n") %>%
mutate_if(is.numeric, tidyr::replace_na, replace=0)
# If the data are only yo be pivoted, return and exit the function
if (returnPivotOnly) {
return(baseData)
}
# Split in to descriptors and data
descData <- baseData %>%
select_at(vars(all_of(byVar))) %>%
mutate(rowN=row_number())
distData <- baseData %>%
select_at(vars(-any_of(byVar))) %>%
as.matrix()
# Scale distdata if requested
if (scaleDistData) {
distData <- scale(distData)
}
# Create the distances, convert back to data frame, pivot_longer, and attach the labels
dist(distData) %>%
as.matrix() %>%
as_tibble() %>%
mutate(row1=row_number()) %>%
pivot_longer(-row1, names_to="row2", values_to="dist") %>%
mutate(row2=as.integer(row2)) %>%
inner_join(descData, by=c("row1"="rowN")) %>%
rename_at(vars(all_of(byVar)), ~paste0(., "_1")) %>%
inner_join(descData, by=c("row2"="rowN")) %>%
rename_at(vars(all_of(byVar)), ~paste0(., "_2"))
}
The function is then run for the obscuration data as an example:
# Find the distances for obscType by locale vs. locale
obscDist <- findCloudDist(cloudSummary,
byVar=c("source", "sourceName"),
fctVar="obscType",
pivotVar="monthfct"
)
obscDist
## # A tibble: 225 x 7
## row1 row2 dist source_1 sourceName_1 source_2 sourceName_2
## <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 1 1 0 kdtw_2016 Detroit, MI (2016) kdtw_20~ Detroit, MI (2016)
## 2 1 2 627. kdtw_2016 Detroit, MI (2016) kewr_20~ Newark, NJ (2016)
## 3 1 3 669. kdtw_2016 Detroit, MI (2016) kgrb_20~ Green Bay, WI (2016)
## 4 1 4 294. kdtw_2016 Detroit, MI (2016) kgrr_20~ Grand Rapids, MI (20~
## 5 1 5 726. kdtw_2016 Detroit, MI (2016) kiah_20~ Houston, TX (2016)
## 6 1 6 347. kdtw_2016 Detroit, MI (2016) kind_20~ Indianapolis, IN (20~
## 7 1 7 1421. kdtw_2016 Detroit, MI (2016) klas_20~ Las Vegas, NV (2016)
## 8 1 8 1092. kdtw_2016 Detroit, MI (2016) klnk_20~ Lincoln, NE (2016)
## 9 1 9 361. kdtw_2016 Detroit, MI (2016) kmke_20~ Milwaukee, WI (2016)
## 10 1 10 345. kdtw_2016 Detroit, MI (2016) kmsn_20~ Madison, WI (2016)
## # ... with 215 more rows
A function can then be created to plot the distances as a heamap and to return the minimum distance for each locale:
plotCloudDist <- function(df,
var1="sourceName_1",
var2="sourceName_2",
met="dist",
roundDist=0,
title="Distance Between Locales",
subT=""
) {
# FUNCTION ARGUMENTS:
# df: tibble or data frame containing distance data
# var1: the variable containing the first locale
# var2: the variable containing the second locale
# dist: the variable containing the pre-calculated distance between var1 and var2
# roundDist: the rounding for the distance in the plot
# Process the data frame and exclude any occurences of distance to self
distData <- df %>%
select_at(vars(all_of(c(var1, var2, met)))) %>%
filter(get(var1) != get(var2))
# Get the locales by minimum distance to any other locale
distHiLo <- distData %>%
group_by_at(vars(all_of(var1))) %>%
filter(dist==min(dist)) %>%
arrange(dist) %>%
pull(var1)
# Create a heatmap of the distances
distData %>%
ggplot(aes(x=factor(get(var1), levels=distHiLo), y=factor(get(var2), levels=distHiLo))) +
geom_tile(aes(fill=dist)) +
geom_text(aes(label=round(dist, roundDist)), color="lightblue") +
labs(x="", y="", title=title, subtitle=subT) +
scale_fill_gradient("Distance", low="black", high="white") +
theme(axis.text.x=element_text(angle=90))
}
The process can then be run on the obscuration data as a check:
plotCloudDist(obscDist, subT="Based on % of obscuration type by month")
FINDINGS FOR OBSCURATION DISTANCE:
This is suggestive that there is a cold-weather cluster, a hot and humid cluster, and then several locales that are more or less standalone but could be grouped loosely to each other.
The functions can then be run on variations of the data:
# Run for minCFactor
findCloudDist(cloudSummary,
byVar=c("source", "sourceName"),
fctVar="minCFactor",
pivotVar="monthfct"
) %>%
plotCloudDist(subT="Based on % in each minimum cloud height bucket by month")
# Run for ceilFactor
findCloudDist(cloudSummary,
byVar=c("source", "sourceName"),
fctVar="ceilFactor",
pivotVar="monthfct"
) %>%
plotCloudDist(subT="Based on % in each ceiling height bucket by month")
Findings by minimum cloud height are similar to findings by obscuration.
Findings by ceiling height show that Lincoln and Newark are close to each other and that San Diego and Las Vegas are both segments of one. Traverse City and Grand Rapids are close to each other; as are New Orleans and Houston. There continues to be a large, close, segment of cold cities.
Of interest is whether a simple kmeans analysis returns the same findings:
set.seed(2006040940)
ceilDistData <- findCloudDist(cloudSummary,
byVar=c("source", "sourceName"),
fctVar="ceilFactor",
pivotVar="monthfct",
returnPivotOnly=TRUE
)
tibble::tibble(locale=ceilDistData$sourceName,
cluster=kmeans(dist(ceilDistData[3:ncol(ceilDistData)]), centers=5, nstart=1000)$cluster
) %>%
arrange(-cluster)
## # A tibble: 15 x 2
## locale cluster
## <chr> <int>
## 1 Grand Rapids, MI (2016) 5
## 2 Traverse City, MI (2016) 5
## 3 Las Vegas, NV (2016) 4
## 4 Newark, NJ (2016) 3
## 5 Houston, TX (2016) 3
## 6 Lincoln, NE (2016) 3
## 7 New Orleans, LA (2016) 3
## 8 Detroit, MI (2016) 2
## 9 Green Bay, WI (2016) 2
## 10 Indianapolis, IN (2016) 2
## 11 Milwaukee, WI (2016) 2
## 12 Madison, WI (2016) 2
## 13 Minneapolis, MN (2016) 2
## 14 Chicago, IL (2016) 2
## 15 San Diego, CA (2016) 1
The five-segment k-means is consistent - solo segments for San Diego and Las Vegas; a Grand Rapids-Traverse City segment; a cold-weather midwestern segment (Grand Rapids and Traverse City are on the downwind side of Lake Michigan and would be expected to see much different clouds than cities with similar temperatures); and a catch-all segment.
The data can also be assessed using hierarchical clustering:
hclust(dist(ceilDistData[3:ncol(ceilDistData)]), method="complete") %>%
plot(labels=ceilDistData$sourceName, cex=0.5, main="Hierarchical on Ceiling Height: method=complete")
hclust(dist(ceilDistData[3:ncol(ceilDistData)]), method="single") %>%
plot(labels=ceilDistData$sourceName, cex=0.5, main="Hierarchical on Ceiling Height: method=single")
Similar conclusions can be drawn from hierarchical clustering based on ceiling height buckets by month:
This is suggestive of either 6 or 7 segments in this data. Does k-means match up?
tibble::tibble(locale=ceilDistData$sourceName,
cluster=kmeans(dist(ceilDistData[3:ncol(ceilDistData)]), centers=6, nstart=1000)$cluster
) %>%
arrange(-cluster)
## # A tibble: 15 x 2
## locale cluster
## <chr> <int>
## 1 Las Vegas, NV (2016) 6
## 2 San Diego, CA (2016) 5
## 3 Newark, NJ (2016) 4
## 4 Lincoln, NE (2016) 4
## 5 Houston, TX (2016) 3
## 6 New Orleans, LA (2016) 3
## 7 Grand Rapids, MI (2016) 2
## 8 Traverse City, MI (2016) 2
## 9 Detroit, MI (2016) 1
## 10 Green Bay, WI (2016) 1
## 11 Indianapolis, IN (2016) 1
## 12 Milwaukee, WI (2016) 1
## 13 Madison, WI (2016) 1
## 14 Minneapolis, MN (2016) 1
## 15 Chicago, IL (2016) 1
tibble::tibble(locale=ceilDistData$sourceName,
cluster=kmeans(dist(ceilDistData[3:ncol(ceilDistData)]), centers=7, nstart=1000)$cluster
) %>%
arrange(-cluster)
## # A tibble: 15 x 2
## locale cluster
## <chr> <int>
## 1 Houston, TX (2016) 7
## 2 New Orleans, LA (2016) 7
## 3 Detroit, MI (2016) 6
## 4 Green Bay, WI (2016) 6
## 5 Minneapolis, MN (2016) 6
## 6 Newark, NJ (2016) 5
## 7 Lincoln, NE (2016) 5
## 8 Grand Rapids, MI (2016) 4
## 9 Traverse City, MI (2016) 4
## 10 Indianapolis, IN (2016) 3
## 11 Milwaukee, WI (2016) 3
## 12 Madison, WI (2016) 3
## 13 Chicago, IL (2016) 3
## 14 San Diego, CA (2016) 2
## 15 Las Vegas, NV (2016) 1
Very nicely aligned. The addition of the sixth segment splits the catch-all segment (from the previous 5 cluster analysis) in to a Houston-New Orleans segment and a Lincoln-Newark segment. The addition of the seventh segment splits the cold weather cities in to Chicago-Madison-Milwukee-Indianapolis; and Detroit-Green Bay-Minneapolis.
Are any differences observed in trends for minimum cloud height?
heightDistData <- findCloudDist(cloudSummary,
byVar=c("source", "sourceName"),
fctVar="minCFactor",
pivotVar="monthfct",
returnPivotOnly=TRUE
)
hclust(dist(heightDistData[3:ncol(heightDistData)]), method="complete") %>%
plot(labels=heightDistData$sourceName, cex=0.5,
main="Hierarchical on Minimum Cloud Height: method=complete"
)
hclust(dist(heightDistData[3:ncol(heightDistData)]), method="single") %>%
plot(labels=heightDistData$sourceName, cex=0.5,
main="Hierarchical on Minimum Cloud Height: method=single"
)
There are some differences when using the ‘complete’ linkage method:
When using the ‘single’ linkage method, these patterns become muted:
We would expect k-means to pull out 1-2 cold-weather clusters and a Houston-New Orleans cluster:
set.seed(2006041003)
tibble::tibble(locale=heightDistData$sourceName,
cluster=kmeans(dist(heightDistData[3:ncol(heightDistData)]), centers=5, nstart=1000)$cluster
) %>%
arrange(-cluster)
## # A tibble: 15 x 2
## locale cluster
## <chr> <int>
## 1 Las Vegas, NV (2016) 5
## 2 Lincoln, NE (2016) 5
## 3 Houston, TX (2016) 4
## 4 New Orleans, LA (2016) 4
## 5 San Diego, CA (2016) 3
## 6 Detroit, MI (2016) 2
## 7 Grand Rapids, MI (2016) 2
## 8 Indianapolis, IN (2016) 2
## 9 Milwaukee, WI (2016) 2
## 10 Madison, WI (2016) 2
## 11 Minneapolis, MN (2016) 2
## 12 Chicago, IL (2016) 2
## 13 Traverse City, MI (2016) 2
## 14 Newark, NJ (2016) 1
## 15 Green Bay, WI (2016) 1
tibble::tibble(locale=heightDistData$sourceName,
cluster=kmeans(dist(heightDistData[3:ncol(heightDistData)]), centers=6, nstart=1000)$cluster
) %>%
arrange(-cluster)
## # A tibble: 15 x 2
## locale cluster
## <chr> <int>
## 1 Houston, TX (2016) 6
## 2 New Orleans, LA (2016) 6
## 3 Newark, NJ (2016) 5
## 4 Green Bay, WI (2016) 5
## 5 San Diego, CA (2016) 4
## 6 Detroit, MI (2016) 3
## 7 Grand Rapids, MI (2016) 3
## 8 Indianapolis, IN (2016) 3
## 9 Milwaukee, WI (2016) 3
## 10 Madison, WI (2016) 3
## 11 Minneapolis, MN (2016) 3
## 12 Chicago, IL (2016) 3
## 13 Traverse City, MI (2016) 3
## 14 Lincoln, NE (2016) 2
## 15 Las Vegas, NV (2016) 1
tibble::tibble(locale=heightDistData$sourceName,
cluster=kmeans(dist(heightDistData[3:ncol(heightDistData)]), centers=7, nstart=1000)$cluster
) %>%
arrange(-cluster)
## # A tibble: 15 x 2
## locale cluster
## <chr> <int>
## 1 Indianapolis, IN (2016) 7
## 2 Milwaukee, WI (2016) 7
## 3 Madison, WI (2016) 7
## 4 Minneapolis, MN (2016) 7
## 5 Chicago, IL (2016) 7
## 6 Lincoln, NE (2016) 6
## 7 Newark, NJ (2016) 5
## 8 Green Bay, WI (2016) 5
## 9 Las Vegas, NV (2016) 4
## 10 Detroit, MI (2016) 3
## 11 Grand Rapids, MI (2016) 3
## 12 Traverse City, MI (2016) 3
## 13 Houston, TX (2016) 2
## 14 New Orleans, LA (2016) 2
## 15 San Diego, CA (2016) 1
tibble::tibble(locale=heightDistData$sourceName,
cluster=kmeans(dist(heightDistData[3:ncol(heightDistData)]), centers=8, nstart=1000)$cluster
) %>%
arrange(-cluster)
## # A tibble: 15 x 2
## locale cluster
## <chr> <int>
## 1 San Diego, CA (2016) 8
## 2 Indianapolis, IN (2016) 7
## 3 Milwaukee, WI (2016) 7
## 4 Madison, WI (2016) 7
## 5 Minneapolis, MN (2016) 7
## 6 Chicago, IL (2016) 7
## 7 Houston, TX (2016) 6
## 8 New Orleans, LA (2016) 6
## 9 Las Vegas, NV (2016) 5
## 10 Detroit, MI (2016) 4
## 11 Grand Rapids, MI (2016) 4
## 12 Traverse City, MI (2016) 4
## 13 Newark, NJ (2016) 3
## 14 Green Bay, WI (2016) 2
## 15 Lincoln, NE (2016) 1
With five segments, there is:
Adding the sixth cluster splits apart Las Vegas and Lincoln. Adding the seventh cluster cleaves Grand Rapids-Traverse City-Detroit from the other cold weather cities. Adding the eighth cluster splits apart Green Bay and Newark.
Next steps are to explore the precipitation data.